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This paper presents a new synthesis approach for dedicated systems. The aim of our 
synthesis scheme is to achieve an automatic exploration of VLIW processor architectures 
from a pure C description of the input system. The innovation consists in the fact that unit 
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In this paper, we present an approach to hardware/software partitioning for real-time 
embedded systems. The abstraction level we have adopted is referred to as the configuration 
level, where hardware is modeled as resources with no detailed functionality and software is 
modeled as tasks utilizing the resources. Through configuration-level analysis, cost and 
performance tradeoffs can be studied early in the design process and a large design space 
can be explored. Feasibility factor is introduced ... 
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We introduce CASTLE, a design environment for embedded systems. Starting from an 
algorithmic specification in C++/VHDL, CASTLE helps a designer to quickly find a suitable, 
cost-effective implementation of his system. The designer manually partitions the algorithmic 
specification into hardware and software components and refines the hardware architecture 
step by step. CASTLE provides immediate feed-back by displaying the feasibility and 
consequences of each partitioning decision. After partitions ... 
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We present an experimental framework for mapping declarative programs, written in a 
language known as Ruby, into various combinations of hardware and software. Strategies for 
parametrised partitioning into hardware and software can be captured concisely in this 
framework, and their validity can be checked using algebraic reasoning. The method has 
been used to guide the development of prototype compilers capable of producing, from a 
Ruby expression, a variety of implementations involving field-pr ... 
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Design 

Full text available: || Pub|jsher Sjte Additional Information: full citation , abstract 

This paper presents a methodology for hardware-software co-design. It is based on the 
formal description technique LOTOS in the specification phase, and on estimation methods at 
different levels of abstraction in the partitioning phase. The LOTOS specification describes the 
system as a set of interacting communicating processes. Our HW-SW partitioning algorithm is 
guided by communications, performance and area estimates and by the suitability of each 
process for implementation in hardware or sof ... 
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An important presupposition for HW/SW partitioning are sophisticated estimation algorithms 
at a high level of abstraction that obtain high quality results. Therefore the granularities of 
estimation and partitioning have to be adapted adequately. In this paper we discuss the 
effects that arise when the granularities of partitioning and estimation are not adapted in a 
necessary way. Furthermore we present our solution that allows to choose different levels of 
granularities adapted to the estimatio ... 
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methods have been suggested. Thus the designer selects modules for HW or SW- 
implementation for the best possible performance within a set of performance and design 
constraints. This paper describes an estimation method to approximate a priori the entire 
system performance. The estimation method has been integrated into the codesign tool COD 
and first results could be generated. The estimated speed-up has been d ... 
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This paper presents the PACE partitioning algorithm which is used in the LYCOS co-synthesis 
system for partitioning control/dataflow graphs into hardware- and software parts. The 
algorithm is a dynamic programming algorithm which solves both the problem of minimizing 
system execution time with a hardware area constraint and the problem of minimizing 
hardware area with a system execution time constraint. The target architecture consists of a 
single microprocessor and a single hardware chip (ASIC, ... 
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Full text available: || Pub|jsher Sjte Additional Information: full citation , abstract 

Performance modeling and evaluation of embedded hardware/software systems is important 
to help the CoDesign process. The hardware/software partitioning needs to be evaluated 
before synthesizing the solution. This paper presents a co-simulation technique based on the 
use of an uninterpreted model able to accurately represent the behavior of the whole system. 
The performance model includes two complementary viewpoints: the structural viewpoint 
which describes the functional structure, the hardware ... 

Keywords: Hw/Sw systems, Performance evaluation, Co-Simulation, uninterpreted model 
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Abstract 

This paper presents a concept called hierarchically grouped mes- 
sage to improve the performance of geographically distributed timed 
cosimulation. In the proposed method, messages which are trans- 
ferred between simulators in a short period of simulated time are 
hierarchically grouped into a physical message to reduce the num- 
ber of rollbacks in optimistic simulation as well as the communica- 
tion overhead of message transfer. Experiments show the efficiency 
of the proposed method in an internationally distributed cosimula- 
tion environment 

1 tntroduction 

In geographically distributed cosimulation environments, designers 
can simulate a system which consists of various remotely located 
intellectual property (IP) blocks without requiring local copies of 
the IP blocks. IP providers and EDA vendors also have benefits 
of allowing their IP blocks and proprietary tools, e.g. high perfor- 
mance hardware emulators, to be accessed while protecting their 
intellectual property rights. 

However, high communication overhead in geographically dis- 
tributed cosimulation environments prevents designers from per- 
forming detailed timed cosimulation of communication intensive 
systems. The problem gets more serious when interrupt is used as 
the communication protocol in the system being designed, since 
hardware and software simulators should synchronize with each 
other (via slow communication over Internet) at every system clock 
tick to detect the occurrence of interrupt [1], 

There have been few researches on optimizing geographically 
distributed timed cosimulation. As an optimization method, [2] [3] 
present a method, called selective focus, which dynamically changes 
the abstraction levels of communication models to allow designers 
to trade off between performance and accuracy. Contrary to [2] [3], 
we present an optimization method which preserves the accuracy 
of detailed cosimulation. 

In this paper, we focus on geographically distributed timed cosim- 
ulation of systems having interrupt as one of communication pro- 
tocols. By a geographically distributed cosimulation environment, 
we mean a network of workstations (or PCs) over Internet (or Wide 

'This work was supported in pan by ETRI, Korea. 
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Area Network). Basically, our approach to the reduction of sim- 
ulator synchronization overhead (including communication over- 
head) is to apply optimistic simulation concept to geographically 
distributed timed cosimulation since optimistic simulation is ad- 
vantageous especially when communication overhead is dominant 
[1M4][5J. 

However, since communication overhead is excessive in geo- 
graphically distributed cosimulation environments, the performance 
gain obtained by applying conventional optimistic simulation meth- 
ods can be limited. In applying optimistic simulation to geographi- 
cally distributed timed cosimulation, the effects of such high com- 
munication overhead on the increase of cosimulation run-time are 
twofold: (1) excessive rollbacks, i.e. rollback overhead caused by 
the slow transfer of messages compared to the simulation execution 
as well as (2) the communication overhead itself. According to our 
experiments, optimistic simulation suffers from excessive rollbacks 
when intensive synchronization between simulators is performed in 
a short period of simulated time. It is because while messages are 
being transferred via slow communication over Internet, the opti- 
mistic simulator that is to receive the messages runs further into the 
future, which causes rollbacks in the receiving simulator. To re- 
duce such excessive rollbacks and high communication overhead, 
we present a concept called hierarchically grouped message (HM) 
where messages transferred between simulators in a short period of 
simulated time are hierarchically grouped into a physical message. 

This paper is organized as follows. In Section 2, we give a brief 
description of applying optimistic simulation to timed cosimula- 
tion. Section 3 explains our motivation. We present hierarchically 
grouped message concept in Section 4. We give experimental re- 
sults in Section 5. Section 6 concludes this paper. 

2 Background 

In this section, we describe three types of timed cosimulation con- 
sidered in this paper: uni-processor synchronous cosimulation, hy- 
brid cosimulation, optimistic distributed cosimulation. In the fol- 
lowing, we define a message to be a timestamped event. 

2.1 Uni-processor Synchronous Cosimulation 

In Figure 1, we assume that software (SW) and hardware (HW) 
start to run concurrently at time 0. The exact time when HW sends 
an interrupt to SW is not known a priori but given as a time inter- 
val. In Figure l f blank rectangles and numbers on them represent 
simulation workloads and the corresponding local times in the sim- 
ulator they are running on, respectively. Blank arrows represent 
null messages for simulator synchronization only and shaded ar- 
rows represent interrupts from HW to SW. Shaded rectangles rep- 
resent simulator synchronization overhead in cosimulation run- 
time. In synchronous cosimulation as shown in Figure 1 (a), SW 
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Figure 1; Reduction of synchronization overhead by hybrid cosim- 
uiation. 



and HW simulators synchronize with each other at every system 
clock tick to detect the occurrence of interrupt. In synchronous 
cosimuiation, most of simulation run-time can be consumed in sim- 
ulator synchronization [1]. 

2.2 Hybrid Cosimuiation 

Most simulators do not support optimistic simulation features such 
as checkpoint (or state saving) and rollback. The same is true for 
emulators which can hardly support a cycle-accurate state saving 
feature. Therefore we consider a hybrid cosimuiation where opti- 
mistic simulators) and synchronous simulators) (including emu- 
lators) co-exist. A representative case of the application of hybrid 
cosimuiation is accessing a high performance HW emulator via In- 
ternet. 

In hybrid cosimuiation [1] as shown in Figure 1 (b), to reduce 
simulator synchronization overhead, we first run the optimistic sim- 
ulator for a time window of predetermined size W. In this exam- 
ple, we assume that the SW simulator is an optimistic simulator. 
The optimistic simulator stops after the time window W elapses 
or at a time point W* (< W) when a message is sent to the syn- 
chronous simulator (in this example, the HW simulator), and waits 
for messages from the synchronous simulator. During the simu- 
lation, states of optimistic simulation are stored at checkpoints in 
preparation for the rollback. The synchronous simulator starts to 
run until the time point when the optimistic simulation stops. The 
synchronous simulation may stop earlier if the synchronous sim- 
ulator sends a message to the optimistic simulator. In this case, 
since the timestamp of the message sent to the optimistic simula- 
tor is earlier than the time point when the optimistic simulator has 
stopped, the optimistic simulator rolls back to a checkpoint before 
(or equal to) the umestamp of the message. If (here is no message 
from the synchronous simulator to the optimistic simulator, then 
the synchronous simulator stops at the time point W (or W). After 
determining a new W, the optimistic simulator starts to run until 
W. Then, the cosimuiation continues in this way. Note that in hy- 
brid cosimuiation a simulator stops its simulation when it sends a 
message to another simulator or after the time window W or W 
elapses. 

2.3 Optimistic Distributed Cosimuiation 

In optimistic distributed cosimuiation, a set of logical processes 
(physically, optimistic simulators) execute concurrently and com- 
municate by exchanging messages. A logical process (LP), as a 
unit of parallel simulation, consists of (1) the simulation model of 
the sub-system being simulated, (2) a state queue to store the states 
of the simulation model, (3) an input message queue for messages 



which arrive at the LP, and (4) an output message queue for mes- 
sages which the LP sends to other LP's. Each LP has its own local 
time called local virtual time (LVT). Each LP works as follows. 
After advancing LVT, the LP looks up the input message queue to 
find an input message having a Umestamp equal to LVT, processes 
the message, and advances its LVT If there is any unprocessed in- 
put message which has a Umestamp earlier than LVT (we call such 
a message a straggler message), the LP rolls back its LVT accord- 
ing to the Umestamp of the straggler message, i.e. the state stored at 
the time point earlier than or equal to the times tamp of the straggler 
message is restored. 

To support rollback, states are stored at checkpoints. To con- 
strain the memory usage of simulation host for state saving, a global 
virtual time (GVT) is calculated. GVT is the minimum of limes- 
tamps of in-transit messages 1 and local virtual times of all LP's. 
States and messages having timestamps earlier than GVT can be re- 
moved from the state queue and the input/output message queues. 3 
For more details on optimistic distributed cosimuiation, refer to [5]. 
A representative case of applying optimistic distributed cosimuia- 
tion is accessing the simulation models of IP blocks and performing 
their simulations via Internet. 

3 Motivation 

Grouping multiple messages into fewer numbers of physical mes- 
sages gives faster transmission of messages than transmitting each 
message separately, since the communication overhead over Inter- 
net does not strictly depend on the sizes of messages being trans- 
ferred, but rather strongly depends on the number of physical mes- 
sages transferred 

Grouping messages also has the advantage of reducing the num- 
ber of rollbacks. Figure 2 illustrates an example of communication 
of messages between a SW simulator and a HW simulator in opti- 
mistic distributed timed cosimuiation. In Figure 2, we assume that 
SW receives 64 data from HW. In SW processors, such a commu- 
nication can be performed by executing memory load instructions 
(e.g. LDR or LDM instructions in ARM7 processor [6]). lb re- 
ceive each of the data, SW sends the address value to HW (event 
on the address bus). HW sends the datum corresponding to the re- 
ceived address value to SW (event on the data bus). In memory 
load (or store) instructions, the time gap between the event on the 
address bus and the event on the data bus is within a few clock 
cycles in the simulated time. 

However, due to high communication overhead (e.g. at least 
a few milliseconds per message transfer) in geographically dis- 
tributed cosimuiation environments, when the datum requested by 
SW arrives at the SW simulator, the SW simulator (one which 
has millions cycles/sec performance on high performance worksta- 
tions) may have proceeded further into the future in the simulated 
time. Such a straggler message causes rollback in the receiving 
optimistic simulator, in this example, the SW simulator. Figure 
2 also illustrates rollbacks (upward arcs) caused by such straggler 
messages. As shown in Figure 2 (a), optimistic distributed timed 
cosimuiation suffers from excessive rollbacks when intensive syn- 
chronization between simulators is performed in a short period of 
simulated time. 

To reduce such excessive rollbacks, we use a hierarchically 
grouped message (HM) concept. Figure 2 (b) shows simulator syn- 
chronization using HM's. In this example, we group 64 messages 
transferring from SW to HW (HW to SW) into a single physical 
message HM2hw (HM2sw). In constructing a new physical mes- 
sage, we neither merge original events into a new event nor increase 

1 Messages which are in the communication channels between LP's, or not pro- 
cased yet in input message queues. In our implementation, the Internee communica- 
tion channel works as a FIFO queue. 

3 If there is no state stored at GVT, the state having the latest timestamp (but earlier 
than GVT) is kept in the state queue. 



101 



sw 



HW 



SW 



HW 




HM2hw 



HM2sw 



(b) 



Figure 2: Reduction of rollbacks by hierarchically grouped mes- 
sages. 
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Figure 3: An example of hierarchically grouped message. 



the abstraction levels of messages, i.e. original events (i.e. their ab- 
straction Levels) are kept unchanged in our method. Since only a 
single physical message is sent from SW (HW) to HW (SW) in 
Figure 2 (b), rollback occurs only twice in total. 

Such reduction of excessive rollbacks, however, does not come 
for free. Since the transfer of messages is delayed for the con- 
struction of the whole HM, rollback distance (the amount of simu- 
lated time which is canceled by rollback) on each simulator may in- 
crease. In Section 5, however, we show experimentally that such a 
negative effect is negligible. Basically, since optimistic simulation 
is performed, the construction of HM's is possible. It is because the 
causality error caused by the delay of message transfer during the 
construction of HM's can be recovered by the rollback mechanism. 

4 Hierarchically Grouped Messages 
4.1 Specification of HM 

For the explanation of specifying HM's, Figure 3 illustrates the 
construction of an HM for transferring 64 data from SW to HW. 
First, each message represents an event (or simultaneous events) 
on the address/data buses or control signals such as we_b (write 
enable bar). TTie transfer of each datum is specified as a group of 
messages as shown in Figure 3. The transfer of 64 data is specified 
as a group of groups of messages, each group of which represents 
the transfer of a datum. As such, higher level groups of messages 
are constructed by grouping lower level messages (or groups of 
messages) in a hierarchical way. 

Each HM has an (or a set of) address range(s) associated with 
the data belonging to the HM. In Figure 3, the HM transferring 64 
data has an address range from 0x80 to Oxbc. The designer can 
also specify an address range to construct an HM for the purpose 
of performance optimization. 



4.2 Construction of HM during Simulation 

During cosimulation, each simulator monitors the values on the ad- 
dress bus and starts to construct an HM by detecting the start ad- 
dress value (e.g. 0x80 in Figure 3) of the address range of the 
HM. During the construction of the HM, output messages are not 
sent to their receiving simulator. Instead, they are stored in the out- 
put message queue. If the simulator detects the end address of the 
address range (e.g. Oxbc in Figure 3), then the simulator creates a 
physical message with the unsent messages in the output message 
queue and sends it to the receiving simulator. We refer to the time 
period between the start time and the end time of the construction 
of an HM as an HM construction period. 

From the implementation^ viewpoint, an HM is an array of 
messages. From the viewpoint of the receiving simulator which 
reads each incoming message one by one from the Internet com- 
munication channel, there is no difference between hierarchically 
grouped messages and separately sent messages. The construction 
of an HM requires proper modifications in the cases that (1) in- 
terrupt is allowed during the construction of an HM, (2) an HM 
is constructed during the data dependent execution, and (3) a syn- 
chronous simulator in hybrid cosimulation constructs an HM. 

4.3 Handling interrupts 

Depending on whether interrupt is allowed during communication 
between SW and HW, we classify the hierarchically grouped mes- 
sage into two types: interruptible HM and non-interruptible HM. 
We define an interruptible HM as follows. 

Definition 1 If the execution o/SWcan be interrupted while SW 
is constructing (or processing messages belonging to) an HM, the 
HM is defined to be an interruptible HM. 

For example, while SW reads 64 data from HW, the execution 
of SW can be interrupted by a timer interrupt to the S W processor 
unless the interrupt is masked. For the non-interrupuble HM, the 
execution of S W is guaranteed to be continued during the construc- 
tion or reception of the HM. For the interruptible HM, the simulator 
sends a partial HM in the cases described below. By a partial HM, 
we mean an HM which has been constructed until some time point 
before the end address of the HM is reached. 

For the interruptible HM, the simulator sends a partial HM in 
the following two cases. 

Case 1 While the HW simulator is constructing an interruptible 
HM, HW sends an interrupt to SW. 

In this case, since SW execution will be interrupted by the in- 
terrupt sent by HW, the transfer of the interruptible HM is not guar- 
anteed to continue. Thus, the HW simulator stops constructing the 
HM and sends the partial HM to the SW simulator. 

Case 2 During the HM construction period of an interruptible HM, 
the SW simulator processes a message containing an interrupt event. 

In this case, since SW execution is interrupted by the interrupt 
event, the SW simulator sends the partial HM to the HW simulator. 

4.4 Sending a Partial HM in Data Dependent Execution 

To avoid large delay caused by the data dependent executions such 
as data dependent loops during the construction of an HM, the sim- 
ulator sends the partial HM if the delay exceeds a given timeout 
value (Ttimcovt). That is, if LVT-7W Mjtart > T timeoutf then 
the simulator sends the partial HM. Th M^tart represents the local 
virtual time when the simulator starts to construct the HM. The de- 
signer can set a timeout value T t i mt . out . If Tu meovi = 0, then HM 
concept is not used. 
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Table 1: Cosimulation run-times for the H.263 decoder. 
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Figure 4: Optimistic distributed cosimulation using ARM7 ISS and 
Synopsys Cyclone. 



4.5 Construction of HM in Hybrid Cosimulation 

As explained in Section 2, in hybrid cosimulation the simulator 
stops its simulation when it sends a message to another simulator 
or after the time window W or W elapses. However, in applying 
HM concept to hybrid cosimulation, the simulator does not stop 
its simulation during the construction of an HM, Therefore, it may 
continue the simulation beyond W or W\ After the construction of 
the HM, the simulator sends the HM to another simulator, stops its 
simulation, and waits for messages from the other simulator. 

Basically, since HM concept is applied to optimistic simula- 
tion, only the optimistic simulator can construct HM's. For the 
non-interruptible HM, however, the synchronous simulator can also 
construct HM's in hybrid cosimulation since SW execution is guar- 
anteed to be continued during the construction of the non-interruptible 
HM. 

4.6 Calculation of GVT during the Construction of HM 

In optimistic distributed cosimulation, when a simulator calculates 
GVT, it sends a request to the other simulators to obtain informa- 
tion for calculating GVT. When a simulator acknowledges to the 
request, it sends to the requesting simulator the minimum value of 
its LVT and the timestamps of unprocessed messages in its input 
message queue. When the simulator acknowledges to the request, 
if it is constructing an HM, then it sends to the requesting simula- 
tor the minimum value of its LVT, the timestamps of unprocessed 
input messages, and the timestamps of unsent output messages. 

5 Experiments 

We performed geographically distributed timed cosimulation for 
two examples : an H.263 decoder [71 and a JPEG encoder [81. For 
the H.263 decoder, 3 frames of an image called Carphone (QCIF: 
176x144 pixels) are decoded and for the JPEG encoder, a 1 16x96 
image is encoded. For the HW parts of the examples, Discrete 
Cosine Transformation (DCT) and Inverse DCT functions are im- 
plemented. The other parts of the examples are implemented in 
SW. 

We construct HM's for transferring 64 data from SW (HW) to 
HW (SW). In our implementation, a single original message has 
44 bytes information. For transferring one datum from SW (HW) 
to HW (SW), four messages are transferred from the SW simulator 
to the HW simulator. In the case of transferring one datum from 
HW to SW, a single message is transferred from the HW simulator 
to the SW simulator together with four messages transferred from 
the SW simulator to the HW simulator. Thus, an HM from SW 
to HW contains 1 1,264 (=44x4x64) bytes and an HM from HW to 
SW contains 2,816 (=44x64) bytes in total. 

We use an ARM7 instruction set simulator (ISS) having opti- 
mistic simulation features for SW simulation [9]. For optimistic 
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Tbblc 2: Cosimulation run-times for the JPEG encoder. 
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HW simulation, we use a commercial cycle-based simulator, Syn- 
opsys Cyclone utilizing its checkpoint and restore functions [10]. 
Optimistic simulation library functions [5] are linked with ARM7 
ISS and Cyclone, respectively. We also use a HW emulator [11] 
(based on Xilinx XC4085), which does not provide optimistic sim- 
ulation features. 

We use the number of hops N hop to denote the number of 
Internet connections in a geographically distributed cosimulation 
environment. 3 To experiment the effect of communication over- 
head via Internet, we performed cosimulation in two different ge- 
ographically distributed cosimulation environments {Nhop = 3 or 
12). Especially, the case of Nhop - 12 is a connection between a 
workstation (or a PC) at Seoul Nat'l Univ. in Korea and a worksta- 
tion at Virginia Tech. in the U.S. 

For optimistic distributed cosimulation, we run ARM7 ISS and 
Cyclone on two remotely located simulation hosts (two SUN Ul- 
traSparc Ts, 143 MHz). Figure 4 shows a simplified view of our 
optimistic distributed cosimulation. For the case of Synopsys Cy- 
clone, we use C Language Interface (CLI) to link our optimistic 
simulation library functions with Cyclone. We also use a wrapper 
(a Unix process) to issue simulation commands (run, checkpoint, 
and restore as shown in Figure 4) to Cyclone. For hybrid cosimu- 
lation, we run ARM7 ISS (i.e. the optimistic simulator) on a work- 
station and the HW emulator (i.e. the synchronous simulator) on a 
PC (Pentium II, 300 MHz, Win98). 

First, we ran uni-processor synchronous cosimulation of two 
examples using ARM7 ISS and Cyclone on an UltraSparc I work- 
station and obtained 5,816 sec (for H.263) and 1 ,41 8 sec (for JPEG) 
for the run-times. Table 1 and 2 show cosimulation run-times of 
two geographically distributed cosimulation environments. Com- 
pared to the run-times of uni-processor synchronous cosimulation, 
the performance improvement of optimistic distributed cosimula- 
tion (w/o HM) comes mainly from the reduction of simulator syn- 
chronization overhead rather than the benefit from parallel simula- 
tion. Applying HM concept to optimistic distributed cosimulation, 
we can obtain 1.53 and 1.40 times (1.20 and 1.44 times) perfor- 
mance improvement for the H.263 example (for the JPEG example) 
in the two cases of Nhop- 

Table 3 shows the reduction of the numbers of rollbacks by ap- 
plying HM concept to optimistic distributed cosimulation. Figure 
5 shows the histograms of the numbers of rollbacks in the case 
of optimistic distributed cosimulation of the H.263 example (when 
Nhop = 12). In Figure 5, the number of short rollbacks is dramati- 
cally reduced by applying HM concept, while that of long rollbacks 
slightly increases due to the delay of message transfer caused by the 



3 la this paper, N K<>V is defined to be the number of Internet routers (including 
gateways) phis one. 



103 



Thble 3: The numbers of rollbacks in optimistic distributed cosim- 
ulation. 
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Table 4: The numbers of physical messages (No. MSG) and roll- 
backs (No. RB) in hybrid cosimulation. 
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Figure 5: Histograms on rollback statistics (# of rollbacks v.s. roll- 
back distance) in the H.263 decoder example. 

construction of HM. 

In Table 1 and 2, compared to the run-times of uni-processor 
synchronous cosimulation, the performance improvement of hybrid 
cosimulation (w/o HM) is mainly from the reduction of HW simu- 
lation run-time by using a HW emulator instead of the cycle-based 
simulator. As shown in Table 4, by applying HM concept to hybrid 
cosimulation, the numbers of physical messages and rollbacks are 
reduced down to 0.78% and 4.22% for the H.263 example, respec- 
tively (0.84% and 4.87% for the JPEG example). Such reduction of 
the numbers of physical messages and rollbacks gives 1 1 .28 times 
(for H.263) and 5.61 times (for JPEG) performance improvement 
(when Nhop = 3) in Table 1 and 2. In hybrid cosimulation [1], since 
the increase of communication overhead does not change rollback 
behavior, Table 4 gives a single number for each type of cosimula- 
tion. 

In Table 1 and 2, as the communication overhead represented by 
N^p increases, the run-time of hybrid cosimulation without HM 
concept gets increased steeply due to the large numbers of physi- 
cal messages and rollbacks, while HM concept gives much slower 
increase of run-time. 

In Table 1 and 2, HM concept gives better performance im- 
provement in hybrid cosimulation than in optimistic distributed cosim- 
ulation. The reason is as follows. In hybrid cosimulation without 
HM concept, two simulators synchronize at least at every message 
transfer as described in Section 2. For the simulator to start, it 
should wait to receive a message (null message or a message hav- 
ing an event) from other simulator(s). On the contrary, in optimistic 
distributed cosimulation, simulators do not stop to wait for mes- 
sages. Thus, the reduction of the number of physical messages in 
the case of hybrid cosimulation has stronger effect on the reduction 
of the number of simulator synchronization, i.e. the reduction of 
simulator synchronization overhead including rollback overhead. 

6 Conclusion 

In this paper, we present hierarchically grouped message concept to 
reduce the simulator synchronization overhead in geographically 



distributed timed cosimulation. We obtained significant perfor- 
mance improvement by applying HM concept to geographically 
distributed cosimulation environments. Our experiments show that 
HM concept enables geographically distributed timed cosimulation 
to be applied in practical situations. 

Currently, we are integrating hybrid and optimistic distributed 
cosimulation together with HM concept into an existing system de- 
sign framework. Our future work includes developing efficient syn- 
chronization methods in hybrid distributed cosimulation environ- 
ments where software, simulators, hardware simulators, and analog 
simulators co-exist. 
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Abstract 

Behavioral simulation with timing annotations derived 
from performance modeling and analysis is a promising 
alternative for use in evaluating system- level design trade- 
offs [I, 2], The accuracy of such approaches is determined 
by how well the effects of various HW and SW architec- 
tural features, like the Real Time Operating System (RTOS), 
shared memories and buses, HW/SW communication mech- 
anisms, etc are modeled at this level 
We present a study of the effects of shared memory buses 
during system-level performance analysis in the POLIS co- 
design environment, using the example of a TCP/IP Net- 
work Interface System. We demonstrate how the effects of 
the memory arbiter and shared memory bus can be mod- 
eled efficiently at the behavioral level, and used to evaluate 
various design tradeoffs. Experimental results demonstrate 
that modeling these effects can significantly increase the 
accuracy of system-level performance estimates. 



1 Introduction 

Efficient exploration of system- level design tradeoffs de- 
pends heavily on the availability of fast and accurate esti- 
mation and modeling techniques, for metrics such as perfor- 
mance, power, and cost, to guide various design decisions. 
Various techniques have been proposed for performance 
analysis of hardware [3, 4, 5] and software [6, 7]. In this pa- 
per, we focus on performance modeling for mixed HW/SW 
embedded systems. Hardware-software co-simulation [8] 
remains the most popular approach to performance estima- 
tion for such systems. There are several flavors of hardware- 

*This work was started when the authors were at NEC C&C Research 
Labs, Princeton, NJ 



software simulation, with varying degrees of efficiency and 
accuracy. The techniques that involve simulating (RTL) 
hardware models of the embedded processor(s) along with 
the models of the hardware components tend to be the most 
accurate, but are also the slowest. Moreover, detailed hard- 
ware models for embedded processors are often not available 
to system designers. A popular alternative is to use instruc- 
tion set simulators (ISS) to simulate the software compo- 
nents of the system, and HDL simulators to simulate the 
hardware components. Instruction set simulators may be cy- 
cle and bit-accurate, or may abstract out some architectural 
details of the target embedded processor such as pipelines 
and superscalar ordering for efficiency The efficiency of 
this approach may still be limited due to the (assembly or 
binary instruction) level of detail in software simulation, and 
the communication overhead required to synchronize the ex- 
ecution of the ISS and hardware simulator. While there has 
been some work on attempting to reduce the synchroniza- 
tion overhead [9, 10], such approaches are still not very 
efficient for use in exploring tradeoffs during HW/SW co- 
design. Bus functional models of the embedded processors 
may be used to exercise the hardware components without 
needing to run an ISS concurrently, however, only the hard- 
ware functionality is simulated in this approach, making it 
more suitable for validation of the hardware and HW/SW in- 
terface. Using an interface-based design methodology [1 1] 
helps separate the behavior of the components from their 
interface protocols, and allows the use of time and space 
abstractions for efficient validation and analysis. 

Behavioral simulation coupled with timing annotations 
based on performance modeling techniques offers a promis- 
ing alternative for use in evaluating system-level design 
tradeoffs [12, 2]. In such approaches, behavioral models 
of the software components are simulated, and performance 
estimates for blocks of code are used to annotate timing in- 
formation. In the POLIS co-design environment [12], a ho- 



mogeneous behavioral representation is used for hardware 
as well as software components. The behavioral simulation, 
analysis, and evaluation is performed using the PTOLEMY 
heterogeneous simulation environment [13]. Timing infor- 
mation for software modules during simulation is main- 
tained based on performance estimates derived using the 
technique presented in [1]. The accuracy of behavioral sim- 
ulation based approaches is determined by how well the 
effects of various HW and SW architectural features, like 
the Real Time Operating System (RTOS), shared memo- 
ries and buses, HW/SW communication mechanisms, etc 
are modeled at this level. For example, the effects of the 
RTOS are modeled in POLIS during performance analysis, 
and the user can select between several scheduling policies 
(e.g. round-robin, static priority based, etc.) and evaluate 
their impact on the system performance. 

In this paper, we focus on modeling the effects of shared 
memory buses during system-level performance analysis, 
using the POLIS co-design environment. The performance 
of several designs, including graphics and telecommunica- 
tions applications, may be dominated by memory accesses, 
making it important to accurately model memory-related ef- 
fects during system-level design exploration. Using the ex- 
ample of a TCP/IP Network Interface System, we illustrate 
how the effects of the memory arbiter and shared memory 
bus can be modeled efficiently at the behavioral level, and 
used to evaluate various design tradeoffs. Experimental re- 
sults are presented to indicate that ignoring the effects of the 
shared memory access bus would have led to significantly 
incorrect performance estimates, and possibly incorrect de- 
sign decisions. 

The paper is organized as follows. Section 2 provides 
some background about the TCP/IP Network Interface Sys- 
tem used for our study, and the modelling of the system in 
the POLIS co-design environment. Section 3 presents the 
results of the evaluation of the effects of the shared memory 
bus on several design tradeoffs, and section 4 concludes the 
paper and discusses future work. 

2 The TCP/IP System Model 

This section provides some background relating to the 
TCP/IP system, and presents the model used for the system 
in the POLIS environment. 

2.1 Background 

A TCP packet consists of three parts: 

• An IP header containing, among other fields, the source 
and destination IP address. The IP header is usually, 
but not always, 20 bytes long, 

• A TCP header, containing TCP-specific information. 
This is usually another 20 bytes, 



♦ The payload, a variable number of bytes (possibly odd) 
up to a maximum of 65535 bytes. 

The TCP/IP protocol requires various tasks to be performed 
on incoming and outgoing packets, and to maintain the sys- 
tem state. We focus on the evaluation of a dedicated hard- 
ware implementation for one of the tasks that is part of the 
TCP layer - checksum computation. The factors that make 
this task a good candidate for hardware implementation are 
explained later. 

The IP header is protected by its own 16 bits checksum, 
that is computed in the IP layer. Since this is computed over 
such a small number of bytes, it is (relatively) cheap even 
in software. The TCP data has a 16 bits checksum, carried 
in the TCP header. It is computed over: 

♦ The 8 bytes of IP address and 16 bits of length field in 
the IP header, 

♦ The TCP header excluding the 16 bits checksum, 

♦ The payload, taken 16 bits at a time, padding the last 
byte as NULL if required. 

The checksum treats the bytes in pairs, taking each pair 
of bytes as a 1 6 bits integer in big-endian byte order. Each 
1 6 bits number is added in to the temporary result using un- 
signed 32 bit integer arithmetic. To obtain the final check- 
sum, the most significant 16 bits of the temporary result 
are added to the least significant 16 bits, and the result is 
XOR'ed with Qxffff. 

The checksum computation is particularly inefficient on 
little-endian processors because the big-endian 1 6 bit num- 
bers have to be generated by shift -or logic. Also, it is 
basically a repetitive operation over potentially large vol- 
umes of data and contains several bit- level operations. The 
above factors make the checksum computation a good can- 
didate for hardware implementation. We attempted to model 
parts of the TCP/IP system relating to the checksum com- 
putation using POLIS with the motivations of quantitatively 
evaluating (i) the performance improvement obtained by im- 
plementing the checksum computation in HW, and (ii) the 
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Figure 1. The modeled TCP/IP sub-system 
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possible adverse effects of SW and HW processes conflict- 
ing for accessing the shared packet memory. However, we 
believe that the effects of shared memory access on system- 
level performance evaluation that we present are applicable 
to any HW/SW system, and not limited to the design exam- 
ple or HW/SW configuration used for this study. 

2.2 Modeling the TCP/IP subsystem in POLIS 

Figure 1 shows the sub-system that has been described 
in POLIS for our case study. The system was modeled 
as ten interconnecting CFSMs, each specified in ESTEREL, 
and their interconnection was described graphically with the 
Ptolemy user interface. 

For incoming packets, the module create_pack re- 
ceives a packet from the lower layer (in this case, the IP 
layer), and stores it in the shared memory. When it finishes, 
it sends the information about the starting address of the 
packet in memory, the number of bytes and the checksum 
header to a queue (packet queue). From this queue, 
the module ip_check retrieves a new packet, overwrites 
parts of the checksum header (which should not be used 
in the checksum computation) with Os, and signals to the 
checksum process that a new packet can be checked for 
checksum consistency. The checksum process performs 
the core part of the checksum computation, accessing the 
packet in memory through the arbiter and accumulating the 
checksum for the packet body. When it is done, it sends the 
computed 16-bit checksum back to the ip.check process, 
which then compares the computed checksum with the in- 
coming transmitted checksum, and flags an error if they do 
not match. The flow for outgoing packets is similar, but in 
the reverse direction, and there is no need for comparison of 
the final checksum. 

2.3 Behavioral Model of the Memory Bus and 
Arbiter 

In the original behavioral description that was used to 
validate the functionality of the processes, memory accesses 
were modeled by access to a global array, using a C func- 
tion call from Esterel, i.e. the module arbiter shown in 
Figure 1 was not present. However, as we show in Sec- 
tion 3, using the same model for performance evaluation 
suffers from the drawback of ignoring effects such as shared 
memory access conflicts, block access mode (DMA), etc. 
Hence, we described a behavioral model of the shared bus 
and memory arbiter (shown as module arbiter in Fig- 
ure 1) to model the effect of the controller (arbiter) of the 
shared memory bus. The arbiter module is the only 
module that can access the shared memory: it receives re- 
quests from the processes create_pack, ip_check and 
checksum, and is responsible for deciding which module 
is given access to the memory. The functional model of the 



arbiter is such that the access priority scheme can be eas- 
ily changed or parametrized. For example, we may specify 
that in the case of simultaneous requests, the arbiter should 
give higher priority to checksum and lower priority to 
create_pack. 

In our system, the primary functions of the arbiter are: 
(i) to avoid multiple components simultaneously driving the 
bus in an attempt to access memory using a simple request- 
grant protocol, (ii) to resolve simultaneous access attempts 
based on priorities that can be specified by the designer, 
(iii) to allow components to request dedicated access of 
the memory bus for a certain number of bus cycles (block 
access mode or DMA mode). We have created a behavioral 
model of the arbiter and shared memory bus in Esterel that 
is called arbiter in Figure 1. The arbiter process has 
a dedicated interface to each of the processes that require 
to access memory, that can be similar to, or an abstraction 
of, the shared memory bus interface. In addition, each 
process that accesses memory is enhanced to include an 
arbiter interface. For example, the signals that interface 
the arbiter process to the checksum process are shown 
in Figure 2. The interface consists of a memory access 
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Figure 2. The interface of the arbiter model 

request signal reqjchk on which the checksum process 
generates an event to indicate that it would like to access 
memory. The starting address is placed on signal addrxhk, 
and a block size signal nwordjzhk is used (in DMA or 
block access mode) to convey the number of bus cycles of 
dedicated bus access requested. The arbiter generates and 
event on the signal grant -chk to indicate that the request 
has been granted. In addition, there are data in, data out, 
and read/write signals to the memory. 

A part of the Esterel specification of the arbiter pro- 
cess is shown in Figure 3. Signals reqjcreate, reqJp, and 
reqjcheck represent the requests for access to the memory 
bus from the create.pack, ip.check, and checksum 
processes, respectively. Note that the behavior of the arbiter 
is described as an infinite loop which immediately encloses 
a set of nested if — then — else statements that test for 
the presence of events on the various memory access re- 
quest signals. The code within this set of if — then — else 
statements represents the actions to be taken for processing 
a memory access request from the corresponding module. 
Figure 3 only shows the code for processing a memory ac- 
cess request from the checksum process, the parts for han- 
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loop 



if (?reqLcreate=l) then 



elsif ( 



% grant access to create_pack 
req_ip=l) then 



% memory read 



% grant access to create^ pack 
elsif {?reqLchk=l) then 
i : =?addr_chk; 
emi t grant_chk ; 
repeat ?nword_chk times 
if (?rnw_chk= false ) then 

await din_chk; % memory write 

emit addr(i) ; emit din (?din_chk) ; 
emit rnw(?false) ; 
else 

emit addr(i) ; 
emit din (?din_chk) ; 
emit rnw(?true) ; 
await din_mem; 
emit dout_chk(?din_mem) ; 
end if; 
i : =i+l ; 
end repeat; 
emit res_chk; 
end if; 
end loop; 



Figure 3. Esterel model of the arbiter pro- 
cess 



dling requests from other modules are similar. The priorities 
given by the arbiter to requests from the various processes 
are determined by the order in which the request signals are 
tested in the nested if — then — else statements. For ex- 
ample, the code shown in Figure 3 gives highest priority to 
requests from create.pack, since the signal reqjcreate 
is tested for an event first. Thus, changing the memory ac- 
cess priorities of the processes can be achieved by simply 
re-ordering the testing of the access request signals in the 
behavioral arbiter model. 

We would like to reiterate that the behavioral arbiter 
model shown above is not part of the system specification - 
it was added to model the effects of the shared memory bus 
and memory arbiter during behavioral level performance 
simulation. However, during the performance simulation, it 
is treated just like any other module. The implementation 
of the arbiter process is specified to be HW, because it 
allows us finer control of its timing properties. The number 
of memory access cycles, and processing time taken by the 
arbiter, can be easily modeled using await tick statements 
appropriately inserted in the behavioral model. 



3 Performance Simulation and Experimental 
Results 

In the POLIS environment, the system specification, 
which may consist of a PTOLEMY netlist that describes the 
interconnection of the functional components or modules 
and an Esterel specification that describes the functionality 
of each module, is translated into a network of co-design 
finite state machines (CFSMs), which are extended FSMs 
with asynchronous buffered communication. Performance 
analysis is carried out using the heterogeneous simulation 
environment offered by PTOLEMY [13]. The performance 
simulation is based on a C model of each CFSM that is 
automatically generated, using the hardware/software parti- 
tioning specified by the user, the scheduling policy for the 
RTOS specified by the user, and a timing model for the target 
processor that is derived during a characterization step [12]. 
We simulated the TCP/IP subsystem with network traffic 
that was captured using a profiling tool from an existing 
software implementation of the TCP/IP protocol. 

We performed several experiments to demonstrate the 
value added by our behavioral model of the arbiter and 
shared memory bus during system-level design, some of 
which we present here. 
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Figure 4. Variation of computation times with 
DMA block size 

In the first experiment we performed an analysis of the 
variation of the processing times for each module as well 
as the complete per-packet processing time for the entire 
system for various sizes of the DMA block size used for 
memory access. For this experiment, the create.pack 
process was mapped to software running on a MIPS R3000 
processor, and the checksum and ip.chk processes were 
mapped to hardware. Figure 4 shows the variation of (aver- 
age) per-packet processing times for the three processes for 
a test bench consisting of three packets of length 512, 64, 
and 448 bytes, for block sizes of 4, 8, 32, and 64 bytes. The 
following conclusions can be drawn from Figure 4: 

• As expected, the processing times for all the modules as 
well as the total processing time decrease with increas- 
ing DMA block size, since the handshaking overhead 
required to obtain memory access is amortized over a 
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Table 1. Processing times without memory 
conflicts 



packet # 


create_pack 


ip.check 


checksum 


total 


1 


513 


1101 


1088 


2702 


2 


65 


149 


136 


350 


3 


449 


965 


952 


2366 



larger number of data transfers. The decrease is signif- 
icant at lower DMA block sizes. 

• In addition, the sensitivity of the performance of the 
software module (create_pack) to DMA block size 
is higher, since the time required for handshaking with 
the arbiter is much higher for the software module than 
for the hardware modules. 

Note that it would have been impossible to perform the 
above analysis in the absence of the behavioral model of the 
shared memory bus and arbiter, since the reported processing 
times would be constant for various values of block size. 

The next experiment we performed was to evaluate the effect 
of memory conflicts due to the shared memory bus on the perfor- 
mance of the individual processes as well as the overall system 
performance. The performance estimates without and with mem- 
ory conflicts are presented in Tables 1 and 2, for a sequence of 
three packets (5 1 2, 64 and 448 bytes long) that are part of a longer 
stream. The performance estimates without memory conflicts were 
obtained by not including the arbiter process, and modeling mem- 
ory as an array shared between the create_pack, ip.check 
and checksum processes. Access to the shared array is per- 
formed using a C function call annotated with a fixed delay to 
represent the access time of the memory. 



Table 2. Processing times with memory con- 
flicts 



packet # 


create 4)ack 


ip.check 


checksum 


total 


1 


513 


1617 


1538 


3688 


2 


65 


218 


192 


475 


3 


449 


1418 


1346 


3213 



The results indicate that: 



• The performance of the create_pack process was 
not affected by the presence of memory conflicts. This 
is because the memory arbiter gives highest priority to 
requests from create.pack when simultaneous or 
pending requests are present. 

• The per-packet performance estimates of the 
ip.check and checksum processes are in error (un- 
derestimates) by 46.9% and 41.4%, respectively if 
memory conflicts are ignored, and the total perfor- 
mance of the system is underestimated by 36.39% 



It is clear from the above results that the effects of memory 
conflicts due to the use of shared memory and the DMA 
block size need to be considered while estimating the per- 
formance of HW/SW systems. 

4 Conclusions and Future Work 

We presented a case study to study the effects of shared 
memory buses and arbiters during system- level performance 
analysis. Using the case study of a part of a TCP/IP net- 
work interface system, we have proposed a methodology to 
model the shared memory bus and arbiter at the behavioral 
level. We presented experimental results to demonstrate that 
ignoring these effects leads to a large error in system-level 
performance estimates, and that the effects of some design 
tradeoffs cannot be evaluated without modeling memory ef- 
fects accurately. We are currently working on automatically 
generating the models required to incorporate the effects of 
the shared memory bus and memory arbiter during perfor- 
mance analysis of HW/SW systems. 
Acknowledgements: The authors would like to thank 
Leslie French and Toshio Misawa of NEC C&C Research 
Labs for providing the software implementation of the 
TCP/IP system, and for useful technical discussions. 
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L ABSTRACT 

The Codesign Finite State Machine [1] (CFSM) formal 
model provides a suitable approach for the description of 
hardware/software systems. The POLIS tool from Berkeley 
implements the CFSM methodology but currently relies on 
the textually based Esterel specification language as a high 
level for the description of individual CFSMs. The designer 
must then use the Ptolemy simulator to interconnect the 
CFSM network and perform co-simulation. This paper 
describes work in progress in developing a system which 
instead aims to use StatemateTM, a statechart based tool for 
seamless specification and co-simulation of the entire CFSM 
network, whilst using the POLIS tool for 'C\ VHDL code 
generation and performance estimation. This technique 
should give the clear advantages of using a graphical 
specification language together with a uniform co-simulation 
framework. 

1.1 Keywords 

Statecharts, POLIS, CFSMs 

2. INTRODUCTION 

Statecharts[2] are a proven and industrially desirable 
specification methodology. They should also be suitable for 
use as a specification language for a hardware/software 
codesign system. CFSMs use a locally synchronous, globally 
asynchronous paradigm that has advantages for partitioning 
a system. Our work has investigated how individual 
Statechart hierarchies can provide a synchronous model 
which may represent individual CFSMs in a design. 
Statechart represented CFSMs may then be connected in a 
suitable network to form a suitable system model. The 
Statemate tool may then be used to represent the globally 
asynchronous paradigm allowing the CFSM network to be 
simulated. 

Several examples of the use of statecharts for automatic 
hardware and software generation exist. The work of 
Druinsky and Harel[3] developed techniques for mapping 
Statechart trees onto programmable devices, and has been 

Permission to make digital or hard copies of all or part of this work for 
personal or classroom use is granted without fee provided that copies 
are not made or distributed for profit or commercial advantage and that 
copies bear this notice and the full citation on the first page. To copy 
otherwise, to republish, to post on servers or to redistribute to lists, 
requires prior specific permission and/or a fee. 
CODES '99 Rome Italy 

Copyright ACM 1999 1-581 13-132-l/99/05...$5.00 



developed and used by Buchenrieder[4] since. Several 
commercially available systems (Statemate and SpecChart 
code generators) exist, but from a hardware generation 
aspect require the use of behavioural VHDL as an 
intermediate step. In his paperf4] Buchenrieder states: 'The 
Statemate™ code generator produces behavioural VHDL 
which is often not synthesisable'. The author has verified 
this whilst attempting to synthesise seemingly trivial 
Statechart models to hardware via Statemate generated 

VHDL, using the Cadence DFWII™ software suite. 

From a software code generation aspect, to the authors 
knowledge, none of the existing Statechart based methods 
and tools allow for code performance estimation on a target 
processor. This is a major drawback if the tools are to be 
used as the basis for a codesign system. Similarly from a 
hardware performance estimation point of view (with the 
exception of Buchenreiders approach) hardware 
performance can only be estimated via VHDL. 

The Codesign Finite State Machine approach to hardware/ 
software codesign has several benefits. The model is globally 
asynchronous and this greatly simplifies partitioning of the 
model into hardware and software modules. Furthermore due 
to the relatively low level structure of individual CFSMs 
synthesis of hardware or software is relatively 
straightforward. The CFSM software synthesis uses the S- 
Graph as an intermediate step to estimate the performance on 
a target processor. This process has the advantage of 
producing portable and efficient code. 

However since the CFSM model is at too low a level to be 
used directly by designers, higher level specification 
languages are necessary. Suitable languages are the 
synchronous class of languages which include Esterel[5], 
statecharts or a subset of VHDL. 

The POLIS [6] system developed at the University of 
California, Berkeley, implements a HW/SW codesign 
system using the CFSM model as its basis. As the input to 
the system the intermediate level SHIFT language is used. 
SHIFT is capable of describing a hierarchical network of 
Codesign Finite State Machines. It allows for the description 
of individual CFSMs as reactive finite state machines. 
Individual CFSMs may then be embedded in a net-list which 
may reference other CFSMs or arithmetic, boolean or user- 
defined functions. 

The chosen specification language for the POLIS project is 
Esterel[5], To complement POLIS an strl2shift converter 



162 



program has been written, and in this way CFSMs may be 
developed in Esterel, converted into SHIFT and input to the 
POLTS program for synthesis. Whilst Esterel is a sound 
synchronous language with a good debugger, it does not 
offer the design advantages of a good graphical language 
such as statecharts. There is also a further drawback as 
having first entered an Esterel description of each CFSM the 
designer must next use the separate Ptolemy tool to 
interconnect CFSMs and perform co-simulation, 

3. USING STATECHARTS TO MODEL 
COMMUNICATING CFSM NETWORKS 

Using our methodology each individual CFSM in a network 
of interconnected CFSMs is modelled by a separate 
statechart hierarchy. The interconnections between 
statecharts (in the form of propagated events and signals) 
may be specified using the Statemate statechart package's 
Activity Charts as illustrated in Figure 1. Flows between 
statecharts are used to represent events and signals. In this 
example of a paint drying system the Statecharts 
PAINT_SYS__CTRL and BLOWERJSYS are referenced as 
providing the behaviour for each CFSM. 



PAINT_MAIN 



PAINT_SYSTEM 



<8>PAINT_SYS_CTRL 



RAISE HA 5.RAISED 



B LOWER^S YSTEM 



©BLOWERJSYS 



Figure 1. Statemate Activity Chart 

3.1 Modelling Asynchronous CFSMs in State- 
mate 

Statemate uses Activity Charts to allow the modelling of 
complex systems composed of many activities, each of 
which contain a behavioural specification in the form of a 
Statechart. The interaction of multiple Statechart hierarchies 
is controllable via start, stop, suspend and resume 
commands which may be issued by controlling Statecharts. 
The CFSM network must eventually be implemented in a 
combination of software and hardware. In the case of 
software a scheduling policy must be implemented on the 
target system. We have therefore modelled a basic scheduler 
in Statemate, allowing control of the interaction of events 
between various CFSMs. 

The scheduler consists of a Statechart (Figure 2) which 
implements the functionality of a scheduler and controls the 
buffering of system inputs and outputs. Every simulated 



clock cycle, the scheduler triggers a Statemate Activity 
which delects any system input changes and places them in 
an input buffer modelled in Statemates language using an 
array. This activity is continuously executed THROUGH- 
OUT the system being in state IDLE. After each execution 
of mis activity the Statechart monitors the input arrays for 
changes in input signal corresponding to each CFSM. If any 
changes are delected then the scheduler algorithm is run. 
This scheduler algorithm is specified by the statechart 
DO.SCHEDULE. 



SCHED.CTRL 



/fit*J\f ACTTViTYl );stl {ACTIVXTY2 ) ; 
ACTiyiTY_NO : =1 




[ (any<INJl.ISTl) or 
any <IN_X.IST2) ) ] 



31 0_SCHffDUI*E 



Figured Statechart based scheduler 

The simple algorithm presented here in Figure 3 allows for 
scheduling of just two activities but could be expanded for 
more or possibly 'n' activities. 



D0J5CHEDULE 



|SET_CURRENTJK> J 



[any(IN_LISTl) 

/rcmr^nwi) /r& (ACTIVIT Y^) 



Cany(IN_UST2) 

£ ACTIVITY N0=2] 



GEN_IPS1> 



sp 



GEN_IPS2> 



.IPS JUT) 



(GEN_IPS_AC) 



D0_ACTIVITY1> 



tA (en (DoVaCTIVITYI ) , 1 ) 





D0_ACTIVITY2> 



tm(ei7( D0_ACTIVITY2 

ysiw 



TACTIVITY2) 



i(e__. . 

Id! {ACTx ^,, 

//tl (PR 0P_S IGS ) 
^ - I [not any (IN LISTI } 



and nob anv(IN_LIS1E ) ] 



1) 



Figure 3. Statechart for scheduler algorithm 

A static reaction (code triggered when entering a state) 
within slate SET_CURRENT_AC chooses the neiU activity 
to be scheduled in a 'round robin* fashion. Two transitions 
then use the Statemate rs!(A) 'resume activity 1 command to 
execute the given task if it currently has pending input 
event(s). 
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The GENJPS state triggers a Stalemate activity which 
propagates pending input events to the relevant activity 
being scheduled. The system remains in this state until all 
events/signals have been propagated. The system then enters 
the DO_ACTIV]TYn state wherein ACTTVITYn is 
executed for a simulation clock cycle. The Statemate toolset 
supports two 'time models' and in this case if its 
Asynchronous time model option is selected, one clock 
cycle corresponds to as many simulation 'micro-steps' so as 
to result in a new stable system stable state. So resuming 
execution for one cycle results in a single transition in the 
selected activities statechart. Whilst the chosen activity 
executes another activity DETECTOPS is enabled to 
detect output changes in the selected statechart and store 
them in an output signal buffer. Once the chosen activity has 
executed for a simulation cycle the activity is suspended. 
The activity PROP_SIGS is next executed which effectively 
propagates any pending active outputs in the output buffer 
back to the relevant input buffer (so that signals may be 
exchanged between the scheduled tasks). If any active 
pending signals are then found in the input buffer the 
DCLSCHEDULE state is re-entered and the scheduling 
algorithm is repeated, otherwise the state is exited and the 
system enters the IDLE state detecting any further system 
inputs. 

3.2 Embedding performance data in the state- 
chart model 

The above method allows statecharts to be used to simulate 
a CFSM network. At present by using tools we have 
developed it is possible to generate a CFSM description in 
the Berkeley SHIFT format from the Statechart description 
of the CFSM network, and use the POLIS system to 
perform performance estimation and synthesize software 
and hardware. In future work we aim to embed performance 
data back into the Statechart model for use in co-simulation. 

4. GENERATION OF A SHIFT SPECIFICA- 
TION FROM A STATECHART MODEL 

In order to use our methodology of representing a CFSM 
model using Statecharts in a codesign system, we interfaced 
it to the POLIS package to facilitate performance estimation 
and hardware/software synthesis. Since the POLIS system 
supports Estere! we initially considered generating Esterel 
code from our statechart model. 

4.1 Generating Esterel from a statechart 
model 

Representing a statechart design using Esterel is not as 
straightforward as might be thought. This is mainly due to 
the fact that a statechart transition may be triggered by 
either event occurrence, a condition variables value or a 
combination of these two. This in effect means that any state 
in a Statechart having transitions labelled with more than 
just basic events, must be represented in an Esterel program 
using a loop which, continuously checks the status of the 
condition variable. Furthermore in order to devise a 
methodology suitable for automated code generation it was 
necessary to explicitly encode statechart state variables in 



the Esterel code. This is because statecharts allow 
transitions directly into sub-states of a state and visa versa, 
and this is difficult to represent naturally in an equivalent 
Esterel program. 

Taking these considerations we devised a suitable 
methodology for the generation of Esterel from statecharts. 
Unfortunately when this was applied to simple examples, 
although the Esterel code could accurately represent a 
statechart model, the resulting SHIFT description (obtained 
using the POLIS strl2shift utility) was unreasonably large. 
This results in poor performance when hardware or software 
is synthesized. For this reason using Esterel code as an 
intermediate step has been discounted. 

4.2 Direct generation of SHIFT description 
from a statechart model 

The statechart language provides a large number of different 
functions and operators that may be used for transition 
expressions. It is our aim to initially provide support for a 
reasonably large subset of these in our statechart to SHIFT 
converter. The system we have developed initially does not 
handle AND states although we have some ideas of how 
these could be added in the future. This is not as much a 
drawback as might be thought however since similar 
concurrent behaviour can be obtained using multiple 
Statecharts in an asynchronous manner. 

Our converter can handle most common statechart EVENT/ 
CONDITION expressions, including comparisons involving 
data items. 

42.1 Handling Hierachy 

When generating the SHIFT representation we flatten each 
statechart into an FSM representable in the SHIFT 
language. 

Hierarchical transitions are probably the major advantage of 
the statechart language. Our system considers any transition 
from a non-basic state as implicitly representing a transition 
from each of the source state's child states. The target state 
for all transitions must also be represented in the FSM as a 
basic state so if the transition arrives at a non-basic state that 
state's default transition is followed. The process of 
recursive descent is followed until a basic state is found. 

4.2.2 Handling transition expressions 
Statecharts generally include relatively complicated 
transition expressions. The use of pure valued or conditional 
expressions without any event information is also quite 
common. The CFSM methodology on the other hand 
assumes that all pure valued signals carry an associated 
event presence signal. This normally gives an advantage as 
the CFSM can wait for an event related to a change of 
valued signal or data item before reacting. 

There are however instances where the CFSM must 'self 
trigger* to be able to correctly handle pure conditional 
expressions. The Statechart in Figure 4 illustrates such an 
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example. 











HEATER 




[not jH EAT] PURGE \ 

A ri (HEATING C); Arl(PURG NG_C); 

fs!( PURGING C) fsl(HEATING_C) 




PURGER 





Figure 4. Self Triggering Example 



In this example when implementing this Statechart in a 
CFSM when the CFSM transitions from HEATER to 
PURGE it must emit a self trigger event to subsequently re- 
trigger itself. This is necessary due to a need to check if the 
HEAT value is already false, which would mean that the 
CFSM must transition immediately back to the HEATER 
state. 

Two commonly used functions in Statecharts are the tr(C) 
and fs{C) which sense when a condition became true or 
false. In order to provide such functionality in the CFSM it 
is necessary to provide double buffering of condition 
variables which are used in these functions. The CFSM thus 
generates the 'previous value* of each such input on each 
transition. These outputs may then be fed back to the CFSM 
and used at the next time-step. Additional transitions must 
also be provided so that if the CFSM is idling in a given 
state then these inputs are still double buffered (even if there 
is no resulting transition to another state). 

To greatly simplify the evaluation of Statechart expression 
we have used the following rules when generating the 
CFSM: 

• Atomic transitions (those whose expressions are a single 
event or condition) are handled by feeding the relevant 
input directly to the relevant CFSM input. 

• All remaining transitions are assigned their own separate 
pure boolean CFSM input and the transition expression 
is evaluated using a series of combinational functions in 
the CFSM net, and fed to this input. 

4,23 Resolving transition priority 
The CFSM must correctly resolve the priority of statechart 
transitions. This is simply achieved because each transition 
has one and only one input that is used as a trigger. Hence it 
is easy to resolve transition priorities such that low level 
transitions can only occur when higher level transition 
trigger functions are not currently active. 

In the case where two transitions existing with the same 
priority are non-deterministic our converter currently 
ensures deterministic behaviour in the resulting CFSM by 
making the two transitions mutually exclusive. 

5. USING THE SYSTEM 

To date we have completed the described Statechart-CFSM 



converter and tested it with simple examples. Synthesis of 
software is satisfactory and we are able to run the POLIS 
generated code on a UNIX™ workstation. 

The size of the resulting SW/HW (from the POLIS 
software) via our system compares very favourably with 
that from Esterel code and the appropriate converter. This is 
proven by a comparison (see Table 1) where we have hand- 
coded an equivalent description of a statechart in Esterel, 
Generally our system achieves more efficient results than 
the Esterel specification when specifications have many non 
event expressions within them. 



Specification 
method 


POLIS 

SW 
synth. 

time 
(SPARC 

20) 


SW 
Costs 
for68hcll 
(see key) 


HW 

Costs 3 
(see key) 


Esterel 


800 sec 


size: 3911 
min t: 338 
max t: 1128 


POLIS 
failed to 
synth. 
within 
4hrs 


Statechart 


8 sec. 


size: 843 
min t: 223 
max t: 511 


pi: 41 
po; 6 
lat: 10 
sop: 353 
fac: 235 


Key for 
synthesis 
costs 


size = s/w size in bytes 
min t = min cycle time in seconds 
max t = max cycle time in seconds 
pi/po = no. primary inputs/outputs 
lat = number of latches 
sop/fac = no. literals in sum-of product or 
factored form 



Table 1: Comparison of HW/SW synthesis 



^Technology independant form. 

Synthesis of hardware using the POLIS system currently 
raises some problems. Due to the relatively high 
ARITHMETIC complexity of such CFSMs we are 
experiencing unreasonably long hardware synthesis times. 
The reason for this appears to be that the POLIS system is 
attempting to flatten all parts of the resulting network and 
then apply boolean minimisation. 

One possible solution to this problem may be to use POLIS 
generated behavioural VHDL together with commercial 
synthesis tools. The improved generation of VHDL from the 
CFSM network is a feature of the new release of the POLIS 
software. 

6. CONCLUSIONS 

Statecharts are a convenient specification methodology for 
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use in conjunction with CFSM theory for the description of 
individual CFSMs. As they are graphical they are easier for 
the designer to visualise, and using tools such as Statemate 
offer powerful simulation capabilities. 

Our methodology promises several advantages over the 
standard use of POLIS with the Esterel language for 
specification and Ptolemy for co-simulation as a codesign 
system: 

• For the statechart models that we have considered, the 
resulting the CFSMs are smaller than those obtained 
standard via Esterel with the POLIS system. In practice 
this means greater efficiency in S/W and H/W imple- 
mentations. This is demonstrated by the metrics in 
Table 1. 

• Statecharts are in our view a more industrially desirable 
specification methodology than Esterel, being graphical 
in nature. 

• The SHIFT file generated from a statechart is close to 
the original statechart specification. Therefore it may be 
possible for POLIS generated performance data to be 
embedded in the original Statemate model in order to 
provide high-level co-simulation in a uniform environ- 
ment. 

• If the specification and simulation tasks can be inte- 
grated a uniform codesign system will be obtained. 

It is quite early in our project and at present we only give an 
outline of the system we aim to produce. To date software to 
generate the SHIFT code from a Statechart model has been 
developed. Future work will further develop the use of the 
Statemate environment for high level co-simulation 
including performance estimation. We also intend to expand 
our Statemate based scheduler to allow the simulation of 



hardware as well as software CFSMs. 

Our codesign system will then be proved with a suitable 
real-world industrial case study. 
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Abstract 

Timing analysis for checking satisfaction of constraints is a 
crucial problem in real-time system design. In some cur- 
rent approaches, the delay of software modules is precalcu- 
lated by a software performance estimation method, which 
is not accurate enough for hard real-time systems and com- 
plicated designs. In this paper, we present an approach to 
integrate a clock-cycle-accurate instruction set simulator 
(ISS) with a fast event-based system simulator By using the 
ISS, the delay of events can be measured instead of esti- 
mated. An interprocess communication architecture and a 
simple protocol are designed to meet the requirement of 
robustness and flexibility. A cached refinement scheme is 
presented to improve the performance at the expense of 
accuracy. The scheme is especially effective for applica- 
tions in which the delay of basic blocks is approximately 
data-independent. We also discuss the implementation 
issues by using the Ptolemy simulation environment and 
the ST20 simulator as an example. 

1. Introduction 

Timing is one of the most important issues in real-time 
embedded system designs. The correctness of designs 
largely depends on the correctness of the interaction of 
hardware and software modules. Hardware/software 
cosimulation provides an integrated way to simulate this 
interaction, because it not only simulates the functionality 
but also simulates the delay of each module and the timing 
relations among the events. So it is essential for the simula- 
tor to use timing information for each module, especially 
software modules, which is as close to reality as possible. 
An incorrect delay may cause the simulation result to differ 
from the behavior of the implemented system. 

Polis [1] is a hardware/software codesign environment 
for control dominated embedded systems. Polis is based on 
a formal model of computation called codesign finite state 
machine(CFSM). In Polis, systems are modeled as a group 
of communicating CFSM's, each of which is originally 
described in a formal language, e.g. Esterel^. In the simu- 
lations phase, both hardware and software modules are 



simulated in the Ptolemy environment. 

Ptolemy^ is a complete design environment for simu- 
lation and synthesis of mixed hardware-software embed- 
ded systems. In Ptolemy jargon, each functional block (a 
software or a hardware module) is called a star. Each star 
has one or more input and output ports. Stars talk to each 
other through links between ports that carry discrete events 
as FIFO queues. 

A typical design flow in Polis is as follows (of course 
there will be feedback among different stages): 

1. Create the system specification using synchronous- 
reactive system specification tools, such as Esterel. 

2. Compile the source code and generate CFSM models. 

3. Build simulation modules (stars) in the Discrete Event 
domain of Ptolemy. 

4. Select target resources (microcontroller and real-time 
scheduler), assign each star as either hardware or soft- 
ware. 

5. Run the simulation in Ptolemy, paying special atten- 
tion to deadline violation and timing consistency. 

6. If the simulation result is satisfying, synthesize the 
system into software and/or hardware. 

7. Repeat more detailed simulation. 

The software performance issue (clock cycles needed 
for a software module to execute) is involved in step 5 
above. 

Software performance has to be estimated in hardware/ 

software codesign tools^ 5 ^ 10 l The advantages are small 
simulation overhead, ease of integration, and flexibility of 

porting to multiple microprocessors^. The disadvantage is 
the poor accuracy. For hard real-time embedded system, it 
is crucial to provide accurate timing information during the 
simulation phase. 

Instruction Set Simulators (ISS) are software environ- 
ments which can read microprocessor instructions and sim- 
ulate their execution. Most of these tools can provide 
simulation results like values in memory and registers, as 
well as timing information (e.g. clock cycle statistics). 
Thus, ISS's provide a way to refine the timing calculation 
during the simulation phase. 

We begin in section 2 with the analysis of existing soft- 



ware timing estimation methods and their restrictions. In 
section 3, an interprocess communication architecture and 
a simple protocol are designed for integrating Ptolemy with 
ISS's. In order to improve performance, a cached timing 
refinement scheme is presented in section 4. An implemen- 
tation example is given in section 5. 

2. Related Approaches 

Although the idea of performing a hardware/software 
cosimulation by combining RTL hardware simulation with 
cycle accurate instruction set simulators (ISS) is not new, a 
seamless integration of the two environments with high 
accuracy and good performance is still an unsolved prob- 
lem. 

In [9] a method which loosely links a hardware simula- 
tor with a software process is proposed. Synchronization is 
achieved by using the standard interprocess communica- 
tion (IPC) mechanisms offered by the host operating sys- 
tem. One of the problems with this approach is that the 
relative clocks of software and hardware simulation are not 
synchronized. This requires the use of handshaking proto- 
cols, which may impose an undue burden on the imple- 
mentation. This may happen, for example, because 
hardware and software would not need such handshaking 
since the hardware part runs in reality much faster than in 
simulation. 

In [5] a method, which keeps track of time in software 
and hardware independently and synchronizes them peri- 
odically, is described. In [7] an optimistic and non-IPC 
approach for improving the performance of single-proces- 
sor timed cosimulation are presented. 

The biggest problem with all the current strategies is 
how to optimize the communication overhead required to 
synchronize the execution of the ISS and hardware simula- 
tor. Simply synchronizing the hardware and the software 
simulator in a lock-step fashion at every clock cycle would 
result in very limited performance due to the very low sim- 
ulation speed obtainable. The other typical drawback of 
this approach is that the system has to be recompiled and 
resynthesized when the partition is changed. 

The commercial tool Mentor-Seamless CVE ^ can 
offer a certain number of optimization techniques such as 
an instruction-fetch optimization that can directly fetch 
operations to the memory-image server, thus eliminating 
these cycles from the logic simulator workload. With a 
memory image server for memory read/write optimization, 
it can process operations 10,000 time faster than a logic 
simulator. This optimization is controlled by the user: early 
in the co-verification process, all memory operations 
should be directed to the logic simulator to debug all the 
operations of the memory sub-system. As the co- verifica- 
tion progresses, larger amounts of memory access can be 



optimized gradually, further speeding up the software exe- 
cution. 

The link with an ISS is not only applicable to hardware 
software cosimulation, but also useful for the general prob- 
lem of embedded program timing analysis. For example, in 
[4] an approach which combines simulation with formal 
techniques is presented. They address the timing analysis 
problem by trying to approach each step in the analysis 
with the best methods currently known. They perform an 
architecture classification and a successive analysis and 
combine the classical approach of Instruction Timing 
Addition (ITA), in which the addition of the execution 
times in a basic block or in a path segment is computed, 
with a Path Segment Simulation (PSS), in which a cycle 
true processor model is used on a specific program path. 
Basically, they use the ITA approach for addressing the 
problem of data dependent instruction execution timing 
typical of some microcoded CISC architectures (e.g. multi- 
plication) and PSS for all the other problems related to the 
impact of the architecture (pipelining, caching, supersca- 
larity). The ISS is run only once on all basic blocks in the 
program and then the information collected is used with a 
formal analysis to determine the worst case execution time 
of the program. 

Our approach is close to [4], but we determine which 
basic blocks are used at run-time by using a delay caching 
technique. Our approach is to leverage on an existing 

approach to time-approximate cosimulation^, based on 
source-code estimation of execution time, and refine its 
precision by using an ISS. In particular, we do not require 
the designer to change the system specification. Hence we 
preserve the partitioning approach based on rapid interac- 
tion with the simulator, without the need to recompile or 
modify the specification (other than change implementa- 
tion attributes for modules). The performance remains 
acceptable, with only a slight decrease in accuracy, thanks 
to a totally automated caching approach. 

We also standardize the interface between the ISS and 
the system simulator, thus making the porting to a new ISS 
very easy. 

3. Architecture and Protocol Design 

The integration of Ptolemy and ISS should have the fol- 
lowing properties: 

• Supporting different ISS's 

• Uniform interface 

• Minimize code generation 

To achieve these requirements, an interprocess commu- 
nication architecture is provided. More precisely, a wrap- 
per program is designed to glue Ptolemy and the ISS 
together. 

The IPC architecture is shown in Figure 1. The wrapper 
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Figure 1. Interprocess Communication Architecture 

program acts both as an interface unifier and a command 
translator. For each kind of ISS, a wrapper is built specifi- 
cally so that it accepts the predefined ISS commands from 
Ptolemy, translates them into ISS syntax, and vice versa. 

The syntax is defined to be concise enough so that the 
communication on each pipe is minimized. In other words, 
the commands adapt the CFSM simulation model to the 
ISS. The set of commands are listed in Table 1. 

Table 1: Ptolemy-ISS Command Stack 



Name 


Syntax 


Description 


start 


s <module> 
<?output?> 


Initialize ISS, load binary code, 
set breakpoints at the beginning of 
the module and the correspond- 
ing output event emission point. If 
<output> is not set, the default 
end point is the end of the module. 


write 


w <variable> 
<value> 


set a variable; variables could be 
CFSM states, event flags, star 
parameters, input values. 


run 


r 


simulate up to a breakpoint 


statistic 


z 


get the clock cycle statistic 


quit 


q 


terminate the simulator. 



The system works as follows: 

□In the setup phase of each star that uses an external 
ISS, it first checks if the wrapper has been started. If not, it 
forks a subprocess and executes the wrapper. 

□Each time the star gets fired in Ptolemy, it detects 
whether the delay needs to be refined (see section 5). If so, 
it sends the module information together with the variables 
to the wrapper. 

□After the wrapper has received all the information 
needed to run the simulator, it translates it into the ISS syn- 
tax, and sends it to the ISS. 

□The ISS loads the executable code of the correspond- 
ing star, sets the variable values and breakpoints, and exe- 
cutes the code up to the required breakpoint. 

□The wrapper waits until the ISS finishes the requested 
execution, and gets the clock cycle counting information. 

□The wrapper sends the clock cycle counting result to 
the Ptolemy star in a pre-defined format. 

When this value is returned from the wrapper, the star 



adds it to the timestamp of the input event and gets the 
time-stamp of the output event. Then, one communication 
session is considered finished. 

4. Cached Refinement Scheme 

It is easy to imagine that, for a long run, if an ISS is 
used for every firing of a software star, the simulation could 
be quite slow. In order to speed up the timing refinement 
process and to keep the advantage of accurate clock cycle 
counting, we designed a cached refinement scheme, based 
on the properties of the s-graph and the discrete event 
semantics. 

4.1. S-graph 

An s-graph is a model of software execution that is used 
by the Pol is cosimulation. It is a directed acyclic graph 
(DAG) with one source node called BEGIN and one sink 
node END. In s-graphs, there are two more types of nodes, 
ASSIGN and TEST. ASSIGN and BEGIN nodes have only 
one successor, and TEST nodes have two or more. An 
expression is associated with each TEST node, and accord- 
ing to the value of the expression (boolean or integer) the 
TEST node selects one of its children. An execution of an 
s-graph starts from the BEGIN node, traverses several 
nodes, until it reaches the END node. The sequence of 
nodes and edges during one execution forms a path of the 
execution. From the structure of the s-graph, it is easy to 
conclude: 

Property 4,1: An s-graph can only have a finite number 
of paths, and for each execution the path is completely 
determined by the expression values at TEST nodes. 

A node is called an EMIT node if its function is to emit 
an event to an output port of the CFSM. The time delay of 
emitting an output event is the time cost of the execution 
from the BEGIN node to the corresponding EMIT node. If 
we treat the END node as a special kind of event emission, 
the expressions at TEST nodes together with the final point 
of the path (EMIT or END node) uniquely specify a path to 
generate an event. 

Generally speaking, the time delay of an event is not 
equal to the sum of the delays of each node along the path. 
For example, optimizations of the compiler, caching and 
pipelining can largely effect the delays. In particular, deep 
pipelining can overlap the execution of several basic blocks 
in the program. In addition, program and instruction 
fetches introduce dependences of the execution time of a 
path on the sequence of instructions/data fetches. We can 
thus have a great variance in the delay information for dif- 
ferent executions of the path under different input stimuli 
and internal conditions. However, when we look at an 
entire path, the delay often varies in a relatively small 
range. In other words, when considering a medium-large 
level of granularity, most software modules have execution 



delays that are approximately independent of past execu- 
tions and depend mostly on input data. For this sort of 
applications, the delay of an execution path can be stored 
and be used the next time when the same path is traversed. 
Although this approximation reduces the accuracy, the sim- 
ulation execution time can be dramatically improved. 

4.2. The Ptolemy Discrete Event Domain 

The semantics of Ptolemy DE domain^ is that all sig- 
nals (events) have two fields, a tag and a value. The tag is 
the timestamp, which is totally ordered among all the 
events. A star simply receives events from its the input 
ports and generates events to its output ports. The process 
of generating new output events can be divided into two 
parts: calculating the values and finding the timestamps. 
These two tasks are independent of each other, and can be 
done separately. So it is possible to use Ptolemy for behav- 
ior simulation and the ISS for timing. Furthermore, in the 
process of finding out the value of the event, the s-graph is 
executed from the BEGIN node to the proper EMIT node 
or END node. If all the predicates on this path are 
recorded, the internal path information is extracted. This 
unique internal path can then be used for caching timing 
information. 

4.3. Cached Timing Refinement Scheme 

A set of variables iTrace, one for each TEST node 
and termination point, are used to record the expression 
values on the TEST nodes. A table is created to store the 
timing information. The table has two fields; one of which, 
called the key, is encoded from iTrace; the other, called 
the delay, stores the delay value. During the firing of a star, 
the behavior model is first executed in Ptolemy. Then, the 
executed path of the s-graph is stored in iTrace. If this 
particular path has been traversed before, i.e. the same key 
is in the table, the corresponding delay is read and used as 
the delay for this execution. In this process, the ISS is not 
called. If there is no such key in the table, the ISS is called 
as described in section 3. The returned delay value is used 
for event delay calculation, and at the same time a new 
entry is added in the table. 

The advantage of using this scheme is that for each 
internal path, the external ISS is called only once. So the 
number of Ptolemy-ISS communications is significantly 
reduced. It is obvious that this scheme is only appropriate 
for s-graphs with approximately constant delay for each 
path. 

4.4. Stochastic Analysis 

Due to the increasing complexity of the processors that 
are used for embedded applications, it is impossible to 
ignore the effects of caching and pipelining in the software 
part both in cosimulation and in static estimation. More- 
over, a single execution of each basic block of the program 



is not sufficient to accurately characterize the block, 
because it neglects the interaction with previous, subse- 
quent and preempting basic blocks. 

Although stochastic measures are considered unaccept- 
able for hard real-time embedded systems, we believe that 
applying a statistical analysis on the results obtained from 
linking Ptolemy with the ISS at an early design stage could 
improve the accuracy of our high-level trade-off evaluation 
method by automatically stopping the slow cosimulation 
when the measured variance of the delay in a path goes 
below a certain threshold. Alternatively, this provides the 
user with a measure of which module is particularly hard to 
characterize (due to a high variance of measured delays), in 
order to decide to continue simulations with the ISS, or use 
only cached delays. 

Sometimes applications such as multi-media use pro- 
cessor architectures with a sophisticated non-standard 
instruction set. At present, compilers are not able to gener- 
ate good code for that type of architectures and the user 
must manually optimize the code. In the Polis approach, 
this would result in an assembly code subroutine called 
inside Esterel modules. In this case, the modules cannot be 
directly simulated in the Ptolemy environment because the 
code cannot be compiled on a workstation. It is desirable to 
skip the execution of that function in Ptolemy and update, 
by using the ISS, not only the delay information but also 
the arithmetic/logic result of the computation. 

Both activities outlined are still at a very preliminary 
stage, but the results seem quite promising. 

5. Implementation 

A prototype of this scheme has been implemented by 
using the Ptolemy and the ST20 Toolset^ 11 !. Two parame- 
ters are added for each star to allow users to choose the 
delay calculation method— that of [10], cached refinement, 
and uncached refinement. A hash table is used to record 
and retrieve the path delay. The key of the hash table is 
encoded from the iTrace variables and the ID of an out- 
put port.The typical flow of this implementation is shown 

in Figure2. 

s-graph 




Figure 2. Implementation 



Simple benchmarks are tested using these three 
approaches. COMPARE is a 2-input-2-output module 
which compares two integers A and B. If A>B, it emits 
output 01, otherwise it emits 02. ADDER is a floating 
point adder with one input A, one internal parameter B, 
and an output SUM. The results of using software perfor- 
mance estimation and the ISS are shown in Table 2 and 
Table 3 respectively, where dctA is the flag of detecting an 
input A; A stands for the value of input A; output is the 
termination point with 0 for END; estimated is the 
clock cycles obtained from the method of [10]; and mea- 
sured is that from the ISS. 

COMPARE is a module without function calls, where 
the compiler optimization make the delay of an entire path 
to be less than the delay of the sum of each node. 

Table 2: COMPARE MODULE 
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Table 3: ADDER MODULE 
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Table 3 shows a module with two user defined func- 
tions; the delay of these functions are obtained from the 
statistic of sample runs with typical inputs. Also notice that 
the delay of a same path varies with the input data values. 

The experiments are done on a SPARC20 workstation. 
For a simulation of 100 firings of COMPARE or ADDER, 
despite the time of starting the ISS (approximately 3 Sec), 
the cached timing analysis is about 10 times slower than 
that of [10], but 1000 times faster than the non-cached 
exact timing analysis. 



6. Conclusion 

In this paper, an ISS-based timing refinement scheme is 
studied in the context of the Polis codesign approach. By 
using the ISS, some intrinsic problems of software perfor- 
mance estimation are solved. An open architecture is pre- 
sented to adapt the differences among ISS's. Based on the 
properties of the s-graph and the DE semantics, a cached 
timing refinement scheme is studied. The result is that for 
s-graphs with approximately constant delay paths, the 
delay of one execution can be stored and reused the next 
time when the same path is traversed. A prototype has been 
implemented by using the Ptolemy and the ST20 simula- 
tion tools. The result shows that the timing analysis scheme 
can be seamlessly integrated into the Polis environment. 
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Abstract 

Performance modeling and ' evaluation of embedded 
hardware/ software systems is important to help the 
CoDesign process. The hardware! software partitioning 
needs to be evaluated before synthesizing the solution. This 
paper presents a co-simulation technique based on the use 
of an uninterpreted model able to accurately represent the 
behavior of the whole system. The performance model 
includes two complementary viewpoints: the structural 
viewpoint which describes the functional structure, the 
hardware structure, the functional to hardware mapping, 
and the behavioral viewpoint which specifies the temporal 
evolution of each function or process. Attributes are added 
to the graphical model to specify the local properties of all 
components. 

The performance properties of the solution are 
obtained by simulation with VHDL, Software functions are 
executed according to the availability of an execution 
resource which simulates a microprocessor. This technique 
leads to rapidly obtain a lot of results by modifying 
appropriate parameters of the model, and so to easily scan 
the CoDesign space to decide on the best implementation. 
This modeling and estimation technique is fully integrated 
in a whole development process based on the MCSE 
methodology. 

1: Introduction - 

In CoDesign, one major problem concerns the 
performance evaluation during the design step. Indeed, 
designers first have to define the appropriate functional 
architecture and then to find the partitioning and the 
allocation on the selected hardware. This means that the 
solution is deduced from the required performance 
constraints. 

First of all, in order to answer correctly the design 
objective, one needs to consider the whole development life 
cycle and to base system developments on a complete 
design model and methodology. The work presented here is 
based on the use of the MCSE methodology [4] and 
specifically on the benefits of the functional model. Then, 
all along the design process, the selected solution has to be 



verified and evaluated in accordance to functional and non- 
functional requirements. 

In order to avoid the late discovering of performances 
not met, the objective of the CoDesign method is to 
establish and maintain a strong link between the two 
concurrent developments: hardware and software. The two 
development branches result from the Hw/Sw partitioning. 
Deciding on an appropriate partition is therefore essential. 

In this paper, we describe an efficient technique to 
evaluate performance properties of embedded Hw/Sw 
systems in order to correctly decide on partitioning and 
allocation according to performance constraints. Section 2 
presents an important goal the designer is faced with. 
Section 3 describes the proposed CoDesign process. 
Section 4 briefly presents the uninterpreted performance 
model and the meaning of some attributes. Section 5 
describes the co-simulation technique we are developing. 
Through an example, Section 6 explains the use of this 
model to extract performance properties and decide on the 
partitioning. Conclusions are drawn in the last section. 

2: The partitioning goal in CoDesign 

The final quality of systems that designers develop is 
mostly dependent on the development process. The first 
step is concerned with customer requirements which are 
then translated into functional and non-functional 
specifications. Performance constraints are one important 
category of non-functional specifications for Hw/Sw 
systems. From the specifications, designers have to decide 
on an architecture able to satisfy the application 
functionalities and performance constraints. The 
partitioning and the allocation of functionalities onto 
components are decided on during the CoDesign step 
[11],[16]. The last steps concern the implementation, the 
unit tests, the integration of all the parts, the tests and 
certification of the whole system with its environment. 

Partitioning and allocation are strongly dependent on 
non-functional constraints: performances, timing 
constraints, cost, time-to-market, etc. One problem is to 
correctly elicit these requirements with the customer. 
Another problem we consider here is how to decide on the 
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partitioning. As a matter of fact, designers have to estimate 
and predict the performances of selected architecture(s) and 
compare them with the requirements. Later, during the 
synthesis of the solution which leads to the implementation 
step, more accurate information is available to refine the 
performance estimation and, if necessary , correct the design. 

Performances qualify the behavior of the system 
relatively to observation criteria which may be external to 
the system (response time, throughput, etc.) or internal 
(utilization of a resource, bus throughput, etc.) [2],[7],[13], 
Each kind of performance is called a performance index. 
Here we are concerned with the dynamic performances of 
real-time systems which are the most difficult to estimate 
and satisfy. 

The estimation of system performances is usually done 
by analytical methods or simulation techniques [13], In 
order to select the simulation technique for its capability to 
model transients, two types of models are possible: 
interpreted and uninterpreted. An uninterpreted model is a 
model for which the behavior is not dependent on the data 
values. It is the contrary for an interpreted model as it is the 
case for an algorithm or a state-based diagram. Therefore, 
an uninterpreted model is a more abstract model or is an 
abstraction of an interpreted model obtained by removing 
the data or information values. The effect of these data 
values are abstracted and replaced by attributes. For 
example, the attribute Execution Time replaces the 
execution duration of a sequence of statements on a 
processor, the attributes Size and Id replace the content of a 
message. Few performance models and tools exist to 
evaluate the dynamic performances of any kind of systems 
[1LH4]. 

In CoDesign, an accurate estimation of the temporal 
properties of a solution needs to simulate the hardware part 
and the software part together. Since a microprocessor is 
used to run several processes or tasks, its properties and the 
task-scheduling policy mainly define the global system 
behavior. The time scale is not the same either: > 1 \xs for 
the software, < 100 ns for the hardware. Therefore a co- 
simulation is necessary. 

To help designers during the partitioning phase, we 
propose to use a performance model which is an 
uninterpreted model to represent the hardware and software 
organization and behavior of the solution. Properties are 
extracted by co-simulation of this model, which means the 
simulation of the software and the hardware together. This 
technique is much faster than using an interpreted model 
and easily allows to study the influence of some specific 
parameters. Generic architectures may also be studied. 

3: Presentation of the method 

So as to correctly master the partitioning and allocation 
in order to find the most appropriate mapping of the 



functional description onto a hardware architecture, the 
CoDesign process we propose is depicted in Figure 1. For 
more details on the global design process, the reader can 
refer to [4], [9]. In our approach, before partitioning, the 
designer needs to correctly delimitate the critical parts of 
the project, to design a functional solution, to specify the 
performance requirements and the system workload 
conditions, to define a detailed functional solution 
including the geographic partitioning constraints and the 
physical interfaces. 

The CoDesign stage is decomposed into two phases, in 
each one a verification by co-simulation enables to decide 
on corrections or to continue. 



-Figure 1 - The CoDesign process with performance estimation. 

The partitioning and allocation can be based on various 
methods: automatic, semi-automatic, interactive [16]. Since 
the input functional description is conform to the MCSE 
methodology, in [5],[9] we suggested to follow an 
interactive coarse-gain partitioning procedure driven by the 
designer who can easily decide on an appropriate choice for 
each function. 

The uninterpreted performance model presented in the 
next section is then easy to obtain. The structural model 
results from the composition of the functional structure and 
the hardware architecture according to the mapping. The 
behavioral model of each function is an abstraction of the 
algorithmic behavior. Attributes and parameters specify the 
properties of all components. The workload of the system is 
used to define the context of the simulation. The 
performance indexes are used for selecting the results to 
observe. 

In the second phase, when an appropriate partition is 
reached, the functional description, the hardware 
architecture and the mapping are used to obtain the 
hardware and software descriptions by synthesis [12]. Both 
descriptions are used for a final verification by a detailed 
interpreted co-simulation. A back-annotation of execution 
times is also possible to enhance the performance model. 

This process allows to follow a smooth incremental 
design path with a better integration of performance 
mastering. In this way the correction or improvement 
feedback loop is shorter. As a matter of fact, without the 
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uninterpreted performance modeling and co-simulation 
phase, the verification of performance satisfaction is 
possible only after the complete synthesis of the solution 
and a detailed co-simulation which needs more time. 

4: Presentation of the model 

The conceptual model of MCSE [4] includes two views, 
each corresponding to a specific aspect of the solution: 

- the functional model (hierarchical and graphical model) 
describes a system by a set of interacting functional 
elements (organizational dimension or functional 
structure) and the behavior of each of them. 

- the executive model describes the architectural structure 
based on active components (microprocessors, specific 
processors, analog and digital components) and 
interconnections between them. 

These two views, when separately considered, are not 
sufficient to completely describe the solution of Hw/Sw 
systems. It is necessary to add the mapping between the 
functional and the executive viewpoints, defining an 
integration or allocation also called configuration. 

The functional model, located between models 
appropriate to express specifications and models to describe 
the architecture, is suitable to represent the internal 
organization of a system by explaining all necessary 
functions and couplings between them according to the 
problem viewpoint. Designing with this method leads to an 
internal technology-independent solution. All or part of the 
description may be implemented either in software or in 
hardware. Therefore, this model is interesting as a 
specification input for a Hw/Sw CoDesign method based on 
a coarse- grain partitioning 

We enhanced this model according to two 
complementary and orthogonal viewpoints to extend its 
usefulness to performance modeling; 

- the organizational viewpoint (structural model) which 
describes the system by a hierarchical structure 
including the above functional and executive 
structures. 

- the behavioral viewpoint for each function or 
component, which specifies the set of operations and 
their total or partial time ordering. This is an 
uninterpreted model of the function. 

In the next sections we briefly introduce the two 
viewpoints. More information on the performance model 
and its use can be found in [6], [10]. In [6] this model was 
used to analyze the performance properties of a real-time 
video server. 

4.1: The structural model 

The meaning of the structural model is extended by 
considering both the functional meaning (function, event, 
shared variable, port) and the executive meaning 



(processor, signal, shared memory, communication node). 
Thus, it is possible to represent both structures - functional 
and executive - with the same graphical model, and so to 
describe the complete architectural solution with the 
partitioning and allocation (functional to executive 
binding). Figure 2 illustrates the concept. On the left hand 
side, two structures and the partitioning and allocation are 
depicted. On the right hand side, only one structure 
represents the same solution. 



b) The cornpositB structural modet 




c) The structural model with Hw/Sw interfaces 



-Figure 2 - Structural model for performance modeling. 

The example considered here is a simplified 
communication system ComSystem for message transfer 
between producers Prod[l:m] and Consumers Cons[l:nJ. 
The function Emission has to send each message of LReq to 
its corresponding function Reception through the port 
P_Send> To guarantee a correct transfer, each message has 
to be acknowledged via the port P_Ack. Emission uses a 
watchdog function to limit the waiting duration of the 
acknowledgement. 

The executive structure is composed of two processors 
linked by a node representing a bus. Each processor can be 
characterized by two attributes: 'Concurrency (the number 
of functions it can execute simultaneously) and Tower 
(relative CPU speed value). The bus can be specified by its 
concurrency (number of simultaneous accesses), its send 
and receive times for each message. Here the chosen 
allocation is simple to understand since it is based on the 
geographic partition constraint. 

The objective of the performance model (structural 
viewpoint) on the right hand side is to represent the two 
structures and the allocation with only one model Figure 2- 
b depicts such a composite or combined structural model. 
Starting from the functional structure, each processor PI 
and P2 is added as an encapsulation of the set of functions 
that each processor has to execute. This operation 
corresponds to a graph restructuration which is called 
folding: a group of nodes in a graph is selected to form a 
new node composed of the subnodes previously selected. In 
this way, PI and P2 have in fact the meaning of a function 
with nevertheless the two specific properties of a processor 
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'Concurrency and Tower. In this structure the inter- 
processor link through the node Bus is not considered for 
simplification. The temporal properties of the link are 
abstracted and integrated into the 2 relations by P_Send and 
P_Ack. If the above structural model is considered too 
abstract, it is possible to keep the Bus link. In that case, 
interfaces (which are functions) between functions and 
processors must necessarily be added (Figure 2-c). 

4.2: The behavioral model 

The behavioral viewpoint of an active component is 
orthogonal to the structural one. A behavioral model, which 
is hierarchical, graphical and uninterpreted, is described 
according to the vertical axis representing the temporal 
evolution and the horizontal axis describing the data or 
information flow (transactions). 

The temporal description is based on five kinds of 
composition: sequence (&), alternative (I), concurrency (II), 
iteration ({-}), conditional (guarded) activation ([?-!?-]). 
Figure 3 gives the graphical notation for each of them 
(vertical axis). Exclusive or concurrent evolutions are 
drawn as parallel branches. Complex internal behaviors 
result from dynamic instanciations of activities and activity 
refinements. An activity considered elementary is called an 
operation. 



Sequence Concurrency Multiple concurrency Alternative iteration Conditional activation 
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-Figure 3 - Representation of composition rules. 

The modeling of performance implies the notion of 
execution time for operations and exchanges in order to 
extract the global temporal properties of a system from its 
local temporal properties. Therefore, an execution time 
(attribute Time) is added to each operation. 

To express interaction rules between vertical branches 
Temporal dependencies, (i.e. synchronizations, 
communications) and inputs and outputs are represented 
horizontally. An activation condition is elaborated from 
available inputs. The generation of outputs or internal data 
are actions executed after operations. Composition 
operators for interactions are represented in Figure 4. 

fitiictOrderf AND without oroV I Onfyorwl Selection I 

5sS$ 
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H te£ Ms fa* 

OpAIS 0p*I{Sl &S2)OpAI(SI *S2) Op4J(O.S>S1 K">S2) Op&!$PtJ:l 

-Figure 4 - Notation for function or activity interdependencies. 



The actions concern the generation of an information 
item or of an event through the outputs of the component 
including the activity or towards other internal activities. 
For a shared variable, the action concerns a reading or a 
writing operation. The Alternative symbol leads to produce 
only one output. In this case, a rule (attribute) must be 
specified to define the output concerned: determinism, 
random. The Sequence symbol defines the order of 
reception or generation. The Selection symbol allows to 
specify which input or which output in a set or in a vector is 
concerned. The attribute 'Path is used for that purpose. The 
behavioral model is illustrated by the example given further 
in Section 6. 

An information item or a transaction is defined with a 
set of predefined and user carried attributes. All attributes 
are useful to control the temporal evolution of functions, 
activities and operations of the whole structure. 

43: Attributes and parameters 

To extract results from this uninterpreted performance 
model, attributes must be added to the above graphical 
notation. 

The predefined attributes of the structural model 
concern each active component and each relation 
component. For a function, a processor and even a system, 
(i.e. an active structural component), we have selected the 
following attributes: 

- 'Power, floating-point value, (1 by default) 

- "Concurrency, a positive integer, (1) . 

- 'Policy. (PSP, PSD, NPS, TS), (PSP means Preemptive 
Scheduling based on Priority) 

- 'Overhead: a time, (0) 

- 'Priority, an integer, (1 for the lowest priority) 

- 'Deadline: a time, (0). 

These attributes are useful to evaluate the properties of 
an architecture. The attribute Tower simulates the power of 
a computer unit; execution times of all included active 
elements are scaled by this coefficient. It simulates the clock 
speed of the computer. The attribute * Concurrency is useful 
to simulate a component having a limitation of running 
resources. With it, it is easy to simulate a monoprocessor or 
multiprocessor type of component. By changing the 
attribute 'Policy, it is also easy to experiment with the 
influence of different policies and compare them. The 
attribute 'Overhead is useful to simulate the time needed for 
task context switching. The last two attributes 'Priority and 
'Deadline are used to select the most urgent task to run (only 
one of the two values is used according to 'Policy). 

Each kind of relation components is also specified by a 
set of attributes. Here, due to lack of space, we only give the 
selected attributes of a port; 

- 'Policy: (Fifo, Priority), (Fifo) 

- 'Concurrency: a positive integer, (1) 
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- * Capacity: max number of messages, >0, (1) 

- "Write, a time, (0) 

- 'Read: a time, (0). 

The predefined attributes of the behavioral model are: 
Time for operation durations, 'Size for the size of data or 
information items, 'Path to specify a path through a 
selection operator, 'Cond for a conditional loop, 'Id for the 
identification of a function or an activity. 

In general, the value of an attribute is dynamic and is 
defined by any mathematical expression including constant 
values, parameters, other attributes, the current time, 
mathematical and probabilistic functions. 

The resulting model is an uninterpreted one. Notice also 
that several models may specify an active component at the 
same time. This means that during the top-down design 
process the behavioral model is a specification from which 
it is easy to deduce an equivalent structural solution. 

5: Co-simulation and result extraction 

Our performance modeling technique and the 
corresponding simulation method are an integral part of a set 
of tools we have been developing as a help to the MCSE 
methodology. The performance model has to be simulated 
to be usable for CoDesign as a help to decide on partitioning 
and allocation. Two techniques are possible: use of a 
specific simulator developed for the proposed model, 
translation of the model into a language for which a 
simulator already exists. In this latter case, the model is 
translated into an executable description. We are currently 
considering two techniques: translation into VHDL and then 
simulation to extract appropriate characteristics, translation 
into C++ and execution. The process under development to 
evaluate performance is depicted in Figure 4. 

Method under development 



System l 
modalirio"^ 



Model capture 


Graphical 
model 


Translation 


* simuiawDia. 
' VHDL 
! program % 
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In VHDL 




In VHDL 



„ Interpretation 
of results 



_ Attributa definition TransJa|ion_ru]©8) Parameters, workload 

-Figure 5 - Process for performance evaluation. 

Designers first have to define the appropriate 
performance model according to what they want to 
evaluate. The model is captured with graphical tools and the 
attributes of all the elements are added. The graphical model 
is then automatically translated into a simulatable VHDL 
program according to translation rules. The simulation 
VHDL model with defined parameters and an appropriate 
simulation of the workload of the system generates events 
and data which are interpreted to obtain the results. 

The performance analysis of the event trace leads to 
estimate the properties of the solution during the design step 
and to select the best solution and parameters [7]. 

VHDL is very efficient to describe and simulate 
concurrent functions and multiple instanciations with 



generic parameters. The simulation allows to extract 
various characteristics of architectures to evaluate their 
costs and performances. High-level descriptions are also 
very easy to describe and test in the form of uninterpreted 
models. Generic parameters are an efficient way to specify 
the behavior of all types of components of the model. 

But for hardware/software co-simulation, we have 
observed some limitations due to the fact that VHDL was 
conceived to describe circuits rather than systems. 
Probably the main limitation of VHDL for our goal is the 
lack of an external process suspension including freezing 
the process time to simulate a multiple function processor 
sharing. Further details on theltranslation rules into VHDL 
are described in [6], more specifically the execution of 
several functions onto a limited processing resource. 

6: An illustrative example 

The case study we have chosen to illustrate our 
approach is described in [5],[9]. The required goal is to 
design and prototype a distributed communication system 
obtained by assembling many similar boards. On each 
board, producers have to send short messages or packets 
(256 bytes max.) to consumers located on the same board 
or on other boards. Producers and consumers are software 
tasks. A 20 Mbits/s serial bus called TransBus [3] is used 
to interconnect the boards. The system requirements and 
the bus specification are shown on Figure 7. Each message 
includes: the address of the consumer, the length of the data 
part and then the data. 




-Figure 6 - Requirements of the communication system. 

The bus access management is based on a hardware 
token ring. At any moment, only one board must own the 
token. The token is implemented as a boolean signal and all 
the boards are wired as a circular shift register. Only the 
token owner can send a message if needed and then pass on 
the token to its neighbor. 

6.1: The problem to be solved 

The designer's objective is to correctly define and 
implement a board according to performance requirements. 
A generic architecture is quite simple to imagine. In [9], we 
described an architecture based on a microprocessor, an 
FPGA and a shared memory. It is easy here to find that a 
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strict minimum of hardware is necessary to satisfy the 20 
Mbits/s transmission rate and the bit protocol imposed by 
the Transbus. In the rest of the paper, we suppose the 
existence of this hardware part to implement the bus 
interface. It includes a parallel to serial convenor to transmit 
each byte and the reverse for the reception of each byte. 

The problem here is to determine the remainder of the 
solution. The first step consists in defining the functional 
solution. This means identifying all processes and relations 
between them. The next step consists in defining the 
partitioning and the allocation. But to do so, it is necessary 
to have quantitative information on the required 
performances and on the performances estimated according 
to the selected functional design and generic hardware 
architecture. To obtain this information, a model of the 
solution is needed. 

Rather than developing a complete interpreted model for 
both the software part and the hardware part and co-simulate 
it, we show in this example that it is relatively easy to 
estimate various performance indexes on different 
implementations with a generic uninterpreted model and a 
co-simulation of it In the next sections, we describe the 
functional model, the behavioral model, the various results 
obtained by co-simulating the hardware and the software at 
a macroscopic level but sufficiently detailed to rapidly 
observe interesting results. 

6.2: The functional model 

To design this communicating system, it is necessary to 
take into account the decomposition of the system into a set 
of boards and the interconnection bus (the geographical 



distribution of the application). This task is well done by 
applying the specification and functional design steps of the 
MCSE methodology. The result of geographic partitioning 
and introducing the physical interface is described in 
Figure 7-a which presents the complete detailed functional 
solution of each board which satisfies these technological 
constraints. The transbus is here modelled (abstracted) in 
Figure 7-b by a vector of events Token[l:k] to represent the 
token ring and a vector of ports TB[l:k] to describe the 
behavior of the message transfer between each pair of boards. 

Each message produced is sent by a producer Prod[i] to 
the function Routing through its port Treq[i]. The address 
field is used by Routing to determine if the designate 
consumer is local (same board) or distant. For each distant 
communication, the function EmissionMess sends each 
message from Lreq to the addressee board through the port 
TB [addressee]. Since only one board at one and the same 
time must access the TransBus, EmissionMess first has to 
request the token (event EmisReq) and wait for it (event 
TokenOk). When the message sending is finished, the event 
EmisEnd releases the token which is then sent to its 
neighbor (Token[i+l modK])> The function ReceptionMess 
receives each message which concerns the board and sends 
it to the function Dmux through the port hind. Dmux sends 
each message to the addressee consumer. 

6.3: The behavioral model 

The behavior of each function in an uninterpreted form 
is given in Figure 7-c according to the notation described in 
Section 4. Attributes are added to the graphical model to 
specify the complete behavior. To understand the notations, 
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Treq[] means the port of the same index as the producer in 
the vector Treq[l:n]. Treqf:] designates the complete 
vector. This notation is interesting for generic components. 

Producers and Consumers are very simple cyclic 
processes. The size of each produced message is a random 
value (function Uniform). The time interval between two 
successive messages is defined by the execution time of the 
operation Tprod (attribute Time). This is an easy way to 
specify the system workload for this example. 

The function Routing receives messages from all ports 
in Treq[l:n] and simulates the routage of a part of them to 
the port Lcons (local messages). The attributes Proba and * 
(which means Else) are used to specify the selected output. 
Each generated message includes the attribute 'Size of the 
input message. Dmux receives messages from the ports 
Lcons and Lind and routes them to the consumer whose 
identifier is defined by the included attribute 'Id. 

The function TokenManagement is in charge of the bus 
allocation toward only one board. EmissionMess receives a 
message from Lreq and then asks for the token. The 
transmission of each message on Transbus is modelled by 
sending it in the port of TB[:] whose index is equal to the 
attribute 'Id defined as a random index of a board. The 
transmission time is specified by the attribute 'Write which 
uses the size coming from the received message and the 
parameter A defining the time to transmit each byte. The 
function ReceptionMess is a cyclic process waiting for each 
message in its port TB[] and then sending it to the port Lind. 

6.4: Co-simulation and results 

The objective of this example is to show that the model 
described above leads to evaluate the main performances of 
different implementations of the functional solution. To 
enhance the CoDesign approach described in [5], for our 
example, we have experimented 3 different 
implementations. In Figure 8, the functional description is 
decomposed in 3 areas: area (1) includes functions 
compulsorily implemented in software, area (2) includes 
the two transmission and reception functions on the 
TransBus, and area (3) includes only the management of the 
token. 

The question is: how are the performances modified 
when the functions in areas (2) and (3) are mapped onto 
hardware or software? To answer this question, we have 
simulated the uninterpreted model described above. 
Beforehand, it is important to correctly identify the system 
workload and the appropriate results expected in order to 
decide for the best implementation. 

Concerning the system workload, we have stimulated 
the communication part of the system (emission and 
reception on the TransBus) with producers permanently 
sending messages of random size to distant random boards. 
In this case, Tprod*Time «0ms and in the function Routing, 



Proba = 0 (no local transmission). A consumer is also 
supposed to spend at least 1 ms to exploit each input 
message. The servo-control of producers to consumers is 
obtained by the capacity of each port (attribute 'Capacity). 
We have chosen: capacity of Treq[i] and of Tcons[i] ~ 1, 
capacity of Lreq and of Lind = 5. The correct simulation of 
TransBus is obtained with TBI ':]' Capacity »0, which 
means a rendez-vous between the sender EmissionMess 
and the receiver ReceptionMess connected to the port used. 

Concerning the results to evaluate, we consider the 
following as representative of the efficiency of the 
communication system: 

- the latency of a message from the producer to the 
consumer, 

- the throughput on the TransBus, 

- the utilization ratio of the processor running the 
software on each board. 

Because of the random character of behavior of the 
model, the 3 results are evaluated as the average of all 
boards and all producers and when the steady state is 
reached (#0.1 s observed by simulation). 

The interest of the co-simulation is to study the 
influence of different generic parameters of the system. 
Therefore, we have varied the number of boards (3, 6, 9) 
and the number of producers and consumers (generic 
parameter n). 

-A- Maximum hardware (areas (2) and (3) in hardware) 

To obtain appropriate results, it is necessary to know 
the time needed to send each byte on TransBus when 
EmissionMess and ReceptionMess are implemented in 
hardware. The value is taken from [5] where we described 
the solution. Another direct means is to consider the 
TransBus protocol at the bit level: 1 1 bits x 50 ns # 0.7 u.s. 
Therefore we have chosen A = 0.7 [is to represent the 
speed of the hardware. The time for TokenManagement is 
selected to 1 [is. 

All software functions of a board are implemented on 
the same processor. To do that, a function named 
Processor is added which includes all the software 
functions. This function simulates a resource with a 
concurrency degree of 1, which means that only one 
included function can be active at one and the same time. 
Two interesting attributes define such a function: its 
concurrency, and its power (equal to 1 here). 'Power is 
interesting to modify the execution speed of all software 
functions and study its influence. The scheduling policy is 
also to be defined. The attribute 'Priority of each software 
function is used for that purpose. Here the priority is the 
highest for Dmux, then Routing, then Cons[l:n] r and the 
lowest for Prod[l:n]. 

The results are given in Figure 8. K identifies the 
number of boards. 
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-Figure 8 - Results for the maximum in hardware. 



The message throughput on each board is constant and 
equal to the size average of all the produced messages 
(129), each being consumed every 1 ms by a consumer. The 
bus throughput is not dependent on n but on K. The 
message latency increases with the number of producers 
and consumers because each CPU is shared by all of them. 
The CPU utilization rate is relatively low for K-3, because 
the bus is not properly used. 

-B- All functions in software 

All the functions are added inside the function 
Processor. The time needed to send each byte on TransBus 
is now A-7us. The execution time chosen for the operation 
SToken is 20 us which is the time needed for a CPU on 
receiving an interrupt. The results are given in Figure 9. 



Message latency Bus throughput CPU utilization ratio 
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-Figure 9 - Results for all functions in software. 

The bus throughput is now constant and a little lower 
than the previous case because 20 jis are added between 
two successive messages. The latency and the CPU 
utilization rate are similar. 

7: Conclusion 

In this paper, we have described a performance model 
and a co-simulation technique to help designers for system 
partitioning and allocation while developing embedded 
hardware/software systems. The model is of uninterpreted 
type, this means that it represents the whole solution at an 
abstract level but accurate enough to evaluate the system 
properties. Because of this type, the simulation is faster 
than a complete hardware/software interpreted model. The 
performance evaluation is based on a VHDL simulation; 
the VHDL program is obtained by a systematic translation 
of the graphical performance model and attributes 
according to specific translation rules. 

A graphical tool and the automatic VHDL program 
generator is under development. We are also experimenting 
with a translation into C++ to obtain the performance 
results. The resulting method and the tool we are currently 
developing are fully integrated in the complete MCSE 
system-level methodology. 
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Context 

Reuse of Intellectual Property (IP), or Virtual 
Components (VCs), from different internal and 
external sources in Systems-on-Chip, allows 
companies to focus the R&D to their own core 
competencies, and to effectively use other 
companies' specialized expertise for other parts. 
Such a model can only work if there the 
microelectronics system industry worldwide can 
establish an unified vision with a set of open 
technical standards. This view is quite similar to 
design practices at the board level today. However, 
the complexities of future systems-on-chips will 
largely exceed the ones that we currently know at a 
board. Moreover, prototypes require costly silicon 
runs, less signals are visible for probing, less 
debugging facilities are available, and it will be 
much more difficult to analyze possible problems 
when combining several components. Therefore, 
these virtual components need specific models, to 
analyse, compare, debug and validate complete 
system chips and all their interfaces before 
processing the real silicon, but already starting in 
the early design phases. This is what is meant today 
with 'Virtual Prototyping'. 

Virtual prototyping of complete hardware (HW) - 
software (SW) systems is really key, but need to be 
raised to much higher levels of abstraction than 
today's design practices, which are usually at the 
level of synthesizable RTL for custom hardware or 
^Instruction Set Simulator (ISS) for programmable 
We processors. This shift will result in totally new 
system level design environments to capture 
requirements, to specify functionality and 
architectures, to explore different mappings and 
schedulings, to select and encapsulate reusable 
Virtual Components. To be used at 'system level'. 
Virtual Components require several abstract 



models, expressing e.g. the performance, 
functionality or cycle-true behaviour. 

It is exactly the goal of the System Level Design 
(SLD) Development Working Group (DWG) of the 
VSI Alliance to specify standard interfaces, 
standard data formats and standard methods that 
will help system designers to explore, debug and 
verify complex system architectures consisting of 
several Virtual Components by virtual prototyping 
at multiple levels of abstraction. 

Discussion topics 

The discussion should be organized around the 
main achievements of the SLD DWG of VSIA. 
These include following areas in the domain of 
HW/SW co-design and Virtual Prototyping: 

Standard nomenclature and VC model taxonomy: 
Progress to accelerate the encapsulation of Virtual 
Components (VC) in co-design has hit roadblocks 
because of a wide diversity of model terminology in 
use among VC providers, VC integrators, 
designers, semiconductor companies, system houses 
and EDA companies. First experiences of 
integrating third-party components in HW/SW 
systems by several system companies have shown 
that different terminology has already created a lot 
of confusion among the participants. Some 
organizations use many common modeling terms 
with divergent meanings, while others use different 
words to describe the same type of models. While 
this confusion persists, and the electronics 
community lacks a common language, different 
teams will be unable to effectively communicate 
and share models. Therefore, the SLD DWG has 
undertaken an effort to develop a nomenclature and 
modeling taxonomy, which will become a common 
language to describe models and their attributes for 
the VSI membership and the electronics design 
community at large. It contains a classification of 
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system -, architectural hardware - and software 
models, and is publicly available as standard 
reference document. It basically modified and 
augmented previously defined terminology sets, 
broadened parochial definitions, distinguished 
overlapping definitions, equated close synonyms, 
removed inapplicable terms, added new terms, 
clarified poorly defined or misunderstood ones, and 
suggested new wording as replacements or 
synonyms to outdated ones. When appropriate 
existing definitions were lacking, the SLD DWG 
created them. Further evolution in design practices 
for VC integration into System Chips will identify 
the critical model types. Minimizing the number of 
models will reduce the effort required to produce a 
complete design package for a VC. 

Standard Interface Behaviour Description : The 
design methodology promoted by the SLD DWG is 
based on a clear separation of the VC functionality 
and the VC interface. Therefore, the DWG is 
establishing a well-defined hierarchical and multi- 
level standard description of VC interfaces, 
covering all abstractions from high level 
transactions down to detailed timed component 
protocols, implemented in hardware and/or 
software. A clean separation of interface properties 
from VC functionality and behavior, and a clear 
Unking of interface abstractions at the system level, 
is a significant step towards the achievement of 
such a goal. A technique for specifying such 
interfaces will improve VC understanding and 
utilization. It will reduce the time required to 
understand behaviors and interfaces correctly. 
Gaining a faster understanding of VC operational 
principles allows a system architect to explore many 
more options before committing to the design 
phase. This faster VC interpretation and model- 
integration gives more comfort in the exploration of 
non-legacy architectures and so will open co- 
designs to the third-party market. Furthermore, a 
complete definition of the interface abstraction 
hierarchy allows designers, architects and SW 
authors to work within their preferred area of 
expertise (e.g. embedded-software, RTL, etc.), but 
still gives them the ability to effectively 
communicate with the different levels (e.g. unified 
test-benches/test-results can be applied to any view 
of the design.). In addition, the standard Interface 
Behavior description will improve VC model 
supply and generation, VC integration in HW/SW 
co-design, and last but not least, VC protection. 



Standards for Behavioural Modeline of Virtual 
Prototypes; For the descriptions of the VC 
functionality itself, the SLD DWG is shooting for a 
standard library of data types that are commonly 
used in behavioral models for virtual prototyping 
(e.g. in C or C++). Today, system companies are 
using multiple libraries, even within the same 
company. Third party VC vendors are offering 
models with specific libraries, not compliant for 
other environments. Different syntax and/or 
semantics make true exchange of models 
impossible. When high-level models used in virtual 
prototyping of HW/SW co-design systems can apply 
all the same data type libraries, the interoperability 
of these models will be largely enhanced, and co- 
design will become much more user-friendly. 

Performance Modeling Standard for HW/SW 
systems: This exploration began as part of a 
recognition that systems containing VCs require 
high level modeling techniques in order to 
efficiently evaluate the system performance of 
interconnected VCs (microprocessors, DSPs, 
memories, caches, buses, RTOS, etc). The specific 
intent of this specification is to describe the basic 
functional and interface requirements of system- 
level performance models for the most common 
types of Virtual Components. Performance Models 
are often the most abstract models of Virtual 
Components that are used during system design. 
They describe the system task as well as the 
resources together with the abstract and physical 
communication channels. Each of these elements 
can be modeled in terms of its basic processing and 
communication capabilities, such as e.g. the rate of 
processing, the latency of operations, etc, In 
contrast to functional models, abstract performance 
models do not compute the results of the operations. 
System-Level Performance models are used to 
explore different alternative mappings of the system 
tasks on selected resource architectures, and to 
perform trade-off analysis among different 
hardware and software architectures. It allows the 
system designer to obtain a first measurement of the 
quality of the design, to check if the proposed 
architecture will satisfy the overall performance 
requirements and meet the constraints, and to 
identify possible bottlenecks or over-sized parts. 
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